AD— All 1  296  NAVAL  BIODYNAMICS  LAB  NEW  ORLEANS  LA  F/G  5/10 

PERFORMANCE  EVALUATION  TESTS  FOR  ENVIRONMENTAL  RESEARCH  <PETER>— ETCIUI 
JUL  81  R  S  KENNEDY  *  A  C  BITTNER*  R  C  CARTER 
UNCLASSIFIED  NBDL-80R008  NL 


NBDL  _  80R008 


<M 


PERFORMANCE  EVALUATION  TESTS  FOR  ENVIRONMENTAL  RESEARCH 
(PETER) :  COLLECTED  PAPERS 


JULY  1981 


DTIC 


NAVAL  BIODYNAMICS  LABORATORY 
New  Orleans,  Louisiana 


82  02  °  °  091 


Approved  for  public  release.  Distribution  unlimited. 


Unclassified 


SECURITY  CLASSIFICATION  OF  THIS  PACE  (W> ton  Dot*  Bntmro d) 


REPORT  DOCUMENTATION  PAGE 


1.  REPORT  NUMBER 

NBDL  -  80R008 


4.  Title  (ond  Submit) 

Performance  Evaluation  Tests  for  Environmental 
Research  (PETER) :  Collected  Papers 


7.  author!*) 

R.  Kennedy,  A.  Bittner,  Jr.,  R.  Carter,  M.  Krause 
M.  Harbeson,  D.  Me  Cafferty,  R.  Pepper,  and 

S.  Wiker 


9.  PERFORMING  ORGANIZATION  NAME  ANO  AOORESS 

Naval  Biodynamics  Laboratory 

P.0.  Box  29407 

New  Orleans,  LA  70189 


II-  CONTROLLING  OFFICE  NAME  ANO  AOORESS 

Naval  Medical  Research  &  Development  Command 
Bethesda,  MD  20014 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


3.  RECIPIENT'S  CATALOG  NUMBER 


S.  TYPE  OF  REPORT  ft  PERIOD  COVERED 

Research  Report 


ft-  PERFORMING  ORG.  REPORT  NUMBER 

NBDL  -  80R008 


ft-  CONTRACT  OR  GRANT  NUMBER!*) 


10.  PROGRAM  ELEMENT.  PROJECT,  TASK 
AREA  ft  WORK  UNIT  NUMBERS 

Project  F58524 

Task  Area  ZF5852406 

Work  Unit  MF58. 524-002-5027 


11.  REPORT  DATE 


Tokv  1981 


13.  NUMBER  OF  PAGES 

43 


4-  MONITORING  AGENCY  NAME  ft  ADDRESS!!!  dl  Keren  I  from  Controlling  Ottlco)  IS.  SECURITY  CLASS,  (ol  (hi*  report) 

Unclassified 

Ts*.  DECLASSIFICATION?  DOWN  GRADING 
SCHEDULE 


16.  DISTRIBUTION  STATEMENT  !o/  thlt  Report) 

Approved  for  public  release,  distribution  unlimited 


17.  DISTRIBUTION  STATEMENT  (ol  the  aborted  entered  In  Block  20,  II  til  I  tor  on  l  from  Report) 


19.  KEY  WOROS  (Continue  on  rttrtri*  aide  It  neceeemry  Identify  by  block  number) 


Repeated  Measures 
Human  Performance  Testing 
Test  Battery 
PETER 


Memory 

Item  recognition 
S troop 

Code  substitution 


Digit  span 


ABSTRACT  fConf/nu*  on  reeeree  aid*  It  neceeeary  and  Identity  by  block  number) 

This  is  a  collection  of  papers  about  the  ongoing  development  of  Performance 
Evaluation  Tests  for  Environmental  Research  (PETER) .  Environmental  Research 
involves  the  assessment  of  human  mental  and  physical  capabilities  in  unusual 
environments  (e.g.,  vibration,  ship  motion,  deep  sea  diving,  or  outer  space). 
Such  research  often  includes  repeated  measurement  of  the  capabilities  of  the 
same  subjects  before,  during,  and  after  exposure  to  an  unusual  environment. 
PETER  is  being  developed  specifically  for  repeated  measurement,  taking  account 
of  required  properties  of  test  means,  variances,  and  intertrial  correlations. 


oo ,  1473 


EDITION  OF  I  NOVftft  II  OBSOLETE 
S/N  0103-014-6601  | 


Unclassified 

SECURITY  CLASSIFICATION  OF  THIS  RAGE  (Wkon  DM*  B» lered) 


Unclassified 


..^lijHITY  CLASSIFICATION  of  THIS  PAGEOWian  Dm tm  Enlmrmd) 


(20  ABSTRACT) 

.-Candidate  tests  for  PETER  were  suggested  by  the  literature  of  performance 
testing,  as  summarized  in  the  first  paper  of  this  collection.  The  results 
of  the  examinations  of  candidate  tests  with  respect  to  the  required  properties 
are  summarized  in  the  second  paper  of  the  collection.  The  remaining  papers 
deal  with  specific  tests,  including  code  substitution,  stroop,  complex 
counting,  critical  tracking,  time  estimation,  arithmetic,  air  combat  maneuver¬ 
ing,  digit  span,  four  other  memory  tests,  interference  susceptibility,  and 
item  recognition^  that  are  discussed  only  briefly  in  the  first  two  papers. 

These  tests  were  siglected  for  no  particular  purpose,  such  as  measuring  specific 
attributes,  rather  they  were  selected  based  on  availability  and  demonstrated 
usefulness.  This  collection  of  papers  describes  progress  in  the  PETER  project 
up  to  November  1980. 


_  Unclassified 


SECURITY  CLASSIFICATION  OF  THIS  PAGEflFhan  Dafa  Bntmrmd) 


NBDL 


80R008 


PERFORMANCE  EVALUATION  TESTS  FOR  ENVIRONMENTAL  RESEARCH 
(PETER) :  COLLECTED  PAPERS 


Robert  S.  Kennedy,  Alvah  C.  Bittner,  Jr.,  Robert  C.  Carter,  Michele  Krause, 
Mary  M.  Harbeson,  Denise  B.  McCafferty,  Ross  L.  Pepper,  Steven  F.  Wiker 


July  1981 


Bureau  of  Medicine  and  Surgery 
Work  Unit  MF58. 524-002-5027 


Approved  by 


Released  by 


Channing  L.  Ewing,  M.  D. 
Scientific  Director 


Captain  J.  E.  Wenger  MC  USN 
Commanding  Officer 


Naval  Biodynamics  Laboratory 
Box  29407 

New  Orleans,  LA  70189 

Opinions  or  conclusions  contained  in  this  report  are  those  of  the  author(s)  and  do  not 
necesscrily  reflect  the  views  or  the  endorsement  of  the  Department  of  the  Navy. 

Approved  for  public  release;  distribution  unlimited. 

Reproduction  in  whole  or  part  is  permitted  for  any  purpose  of  the  United 
States  /ernment. 

lA 


Summary  Page 


PROBLEM 

The  effectiveness  of  many  man-machine  systems  is  limited  by  the 
performance  of  the  human  component.  Environmental  stressors,  such  as 
ship  motion  or  vibration,  are  a  major  factor  affecting  human  performance. 
Hence,  it  is  important  to  know  the  degree  to  which  performance  capability 
is  altered  by  environmental  stressors  encountered  during  operation  of  a 
man-machine  system.  Human  performance  capability  can  be  assessed  by 
comparing  performance  in  a  standard  environment  with  performance  in  a 
stressful  environment  of  interest.  The  comparison  involves  repeated 
measurement  of  the  same  subjects  in  both  environments  but  not  all  per¬ 
formance  tests  are  suitable  for  repeated  measurement. 


FINDINGS 


1.  Suitability  of  tests  for  repeated  measurement  can  be  represented 
by  the  means,  variances,  and  intertrial  correlations  of  test  scores  obtained 
from  several  measurements  of  the  same  subjects  in  a  standard  environment. 

2.  Tests  become  more  suitable  for  repeated  measurement  after  practice 
by  the  subjects.  The  required  amount  of  practice  varies  from  one  test  to 
another. 


RECOMMENDATIONS 

Tests  that  are  to  be  used  for  repeated  measurement  should  be  practiced 
by  the  subjects  prior  to  being  used  to  obtain  data.  The  required  amount 
of  practice  should  be  determined  from  data  obtained  in  a  standard  environment. 


This  research  .work  was  funded  by  the  Naval  Medical  Research  and  Develop¬ 
ment  Command  and  by  the  Biological  Sciences  Division  of  the  Office  of  Naval 
Research . 

The  volunteers  used  in  this  study  were  recruited,  evaluated  and  employed 
in  accordance  with  the  procedures  specified  in  the  Secretary  of  the  Navy  Instruc¬ 
tion  3900.39  series  and  the  Bureau  of  Medicine  and  Surgery  Instruction  3900.6 
series.  These  instructions  are  based  upon  voluntary  consent,  and  meet  or  exceed 
the  prevailing  national  and  international  guidelines. 

Trade  names  of  materials  or  products  of  commercial  or  non-government 
organizations  are  cited  where  essential  for  precision  in  describing  research 
procedures  or  evaluation  of  results.  Their  use  does  not  constitute  official 
endorsement  or  approval  of  the  use  of  such  commercial  hardware  or  software. 


Table  of  Contents 


Selection  of  Performance  Evaluation  Tests 
for  Environmental  Research 

by  R.  C.  Carter,  R.  S.  Kennedy,  and  A.  C.  Bittner,  Jr .  1 

A  Catalogue  of  Performance  Evaluation  Tests 
for  Environmental  Research 

by  R.  S.  Kennedy,  R.  C.  Carter,  and  A.  C.  Bittner,  Jr .  8 

Performance  Evaluation  Tests  for  Environmental 

Research  (PETER) :  Code  Substitution  Test 

by  R.  L.  Pepper,  R.  S.  Kennedy,  A.  C.  Bittner,  Jr., 

and  S.  F.  Wiker  . . .  13 

A  Comparison  of  the  Stroop  Test  to  Other  Tasks  for 
Studies  of  Environmental  Stress 

by  M.  M.  Harbeson,  R.  S.  Kennedy,  and  A.  C.  Bittner,  Jr .  20 

Performance  Evaluation  Tests  for  Environmental  Research 
(PETER) :  Auditory  Digit  Span 

by  D.  B.  McCafferty,  A.  C.  Bittner,  Jr.,  and  R.  C.  Carter  .  29 

Comparison  of  Memory  Tests  for  Environmental  Research 

by  M.  M.  Harbeson,  M.  Krause,  and  R.  S.  Kennedy  .  34 

Performance  Evaluation  Tests  for  Environmental  Research 
(PETER) :  Interference  Susceptibility  Test  CIST) 

by  M.  Krause  and  R.  S.  Kennedy  .  41 

Item  Recognition  as  a  Performance  Evaluation  Test  for 
Environmental  Research 

by  R.  C.  Carter,  R.  S.  Kennedy,  A.  C.  Bittner,  Jr., 

and  M.  Krause  .  47 


Each  of  these  papers  was  presented  at  a  professional  meeting  or  symposium. 
Acknowledgement  of  previous  publication  appears  at  the  beginning  of  each 
paper. 


Accession  For 

NT  IS  GRAScl 
DTIC  TAP 

Unan-touic-'l 

Just  i r 1  ■ 

L . 

i 

Hv 

D 1 *  r  i 1 

Av  ill  .  1'  v 

A  /  l  ,  ; 

D1  St 

./or 

1 

iii 


ft 


PROCEEDINGS  OF  THE  24TH  ANNUAL  MEETING  OF  THE  HUMAN  FACTORS  SOCIETY 
LOS  ANGELES,  CA,  13-17  OCTOBER  1980 

SELECTION  OF  PERFORMANCE  EVALUATION  TESTS  FOR  ENVIRONMENTAL  RESEARCH* 

Robert  C.  Carter,  Robert  S.  Kennedy,  and  Alvah  C.  Bittner,  Jr. 

Naval  Biodynamics  Laboratory,  New  Orleans,  LA  70189 

ABSTRACT 

A  Diittery  of  Performance  Evaluation  Tests  for  Environmental  Research  (PETER)  that  is  suitable  for 
use  in  repeated  measures  experiments  is  being  developed  at  the  Naval  Biodvnamics  Laboratory.  This 
paper  describes  the  sources  of  tasks  which  have  been  considered  for  inclusion  in  PETER .  It  also  lists 
the  tests  in  the  source  batteries  which  have  or  have  not  yet  been  considered  for  inclusion  in  PETER. 
The  performance  content  of  the  tests  that  have  been  considered  is  compared  with  the  content  of  those 
that  have  not.  Recommendations  are  made  for  selection  of  additional  tests  from  the  source  batteries 
which  will  not  be  redundant  with  tests  that  already  have  been  considered.  This  report  puts  PETER  into 
the  context  of  the  tests  and  test  batteries  which  came  before  it. 


INTRODUCTION 

The  Naval  Biodynamics  Laboratory  is  engaged 
in  study  of  various  measures  of  human  performance 
in  order  to  select  Performance  Evaluation  Tests 
for  Environmental  Research  (PETER)  (Kennedy  & 
Bittner,  1977;  Kennedy,  Bittner,  &  Harbeson, 

1980).  Several  criteria  have  been  used  to  choose 
the  candidate  tests.  Prospective  PETER  tasks  must 
have  been  shown  to  he  diagnostic  of  brain  damage, 
or  to  be  sensitive  to  environmental  stressors,  or 
to  measure  some  aspect  of  human  information  pro¬ 
cessing.  Further,  the  test  materials  were  re¬ 
quired  to  be  statistically  suitable  for  repeated 
measurement  of  subjects*  performance  before, 
during,  and  after  experiencing  an  unusual  environ¬ 
ment.  In  order  to  evaluate  the  suitability  of 
tests,  means,  be  tween-sub jec t  standard  deviations, 
and  cross-session  reliabilities  were  obtained  from 
15  riavs  of  repeated  measures  in  a  standard  en¬ 
vironment.  After  a  reasonable  amount  of  practice, 
the  means,  standard  deviations,  and  reliabilities 
must  have  been  approximately  constant  across 
sessions.  Constant  means  in  a  standard  environment 
ire  preferred  if  changes  due  to  an  unusual  environ¬ 
ment  ire  to  be  interpretable  (Campbell  &  Stanley, 

1 ,  although  1  inearl  v- increasing  means  are  also 
acceptable.  Constant  standard  dev iat Inns  and 
■  res s-sess Ion  reliabilities  are  sufficient  to  meet 
■jn-’e  assumptions  of  repeated  measures  ANOVA  (Winer, 

1 ' ♦  ^ 1  ) p  which  is  often  employed  to  analyze 
•  •-v  i  ronmental  experiments.  These,  then,  were  the 
’ ri'erio  for  suitability  of  a  test  for  assessment 
o*  >.-rf  >rrance  in  exotic  environments.  The  pur- 

of  this  report  aro:  (1)  to  show  the  sources 
> f  r-  sts  which  have  been  considered  for  PETER;  and 
ij.  *.»  .-valve  plans  for  the  selection  of  additional 
T  •  •  -t  s  . 

METHOD 

‘mdidate  tests  for  PETER  have  been  selected 
-I  in!.-  from  other  performance  rest  batteries  bo- 
i  i  ••  of  the  intellectual  and  financial  investment 
:  n  r  batteries  and  the  need  for  use  of  stan- 

:  ir  procedures.  The  sources  from  which  tests 


’.his  research  was  performed  under  Navy  Work 
c  i  t  No.  MEV8 . 57  a .  002 -402  7 .  The  opinions  are 
*  - ;■  > s. •  of  the  authors  and  do  not  necessarily 
rol.vi  those  of  the  Department  of  the  Navy. 


have  been  adopted  for  PETER  include:  Wechsler  ~ 
(1958);  Ekstrom,  French,  Harman  and  Derman  (1976) 
Fleishman  and  Ellison  (1962);  Rose  (1974,  1978); 
Reitan  and  Davison  (1974);  Bennett  (1979);  Under¬ 
wood,  Boruch,  and  Mai mi  (1977);  Video  games,  and 
other  miscellaneous  sources. 

Many  tasks  within  these  batteries  have  not 
yet  been  considered  for  inclusion  in  PETER.  In 
some  cases,  tasks  were  not  adaptable  for  repeated 
measurement.  For  example,  it  would  be  almost  im¬ 
possible  to  generate  many  comparable  forms  of  an 
information  test  (e.g.  Wechsler,  1958).  Numerous 
tests  have  not  been  examined  because  of  necessary 
compromises  involving  resources  available  and 
judgements  of  the  importance  and  uniqueness  of 
test  content. 

Other  batteries  have  not  been  studied  for 
various  reasons.  For  instance,  Fleishman's 
(1964)  tests  of  physical  fitness  have  not  been 
investigated  because  their  scores  are  likely  to 
change  radically  with  repeated  measurement.  The 
extensive  research  of  Alluisi  (e.g.  L966)  and 
others  on  synthetic  work  is  not  yet  reflected  in 
PETER  because  of  the  need  to  demonstrate  suitabi¬ 
lity  of  component  tasks  before  combining  the 
tasks.  In  addition,  batteries  intended  primarily 
for  selection  or  training  evaluation  were  not 
used  because  they  are  usually  proprietary,  and 
because  they  are  more  likelv  to  measure  success 
or  achievement  than  performance.  Finally,  some 
performance  batteries  may  have  been  uninten¬ 
tionally  overlooked. 

A  tabular  approach  was  employed  to  compare 
PETER  with  the  source  batteries  and  to  aid  selec¬ 
tion  of  new  tests  for  possible  inclusion  in 
PETER.  One  table  was  constructed  for  each  source 
battery.  The  tables  give  the  names  of  the  tests 
in  the  battery  and  the  performance  functions 
measured  hv  those  tests.  The  tests  listed  in 
each  table  are  classified  as  having  been  consi¬ 
dered  for  inclusion  in  PETER  or  not.  Hence,  the 
tables  fulfilled  our  first  objective  by  showing 
the  overlap  between  PETER  and  other  test  batteries 
The  second  objective,  selection  of  additional 


Tasks  drawn  from  the  Ekstrom  et  at.  (1976) 
battery  or  its  predecessors  (e.g.  Moran,  Kimble. 

&  Mefferd,  1964)  are  listed  under  this  reference. 


1 


tests  for  PETER,  was  met  by  examining  the  tests 
which  have  not  yet  been  considered  for  inclusion 
in  PETER.  Those  tests  which  measure  content  not 
now  represented  in  PETER  were  recommended  for 
consideration . 

RESULTS 

Table  1  displays  the  tests  in  the  Wechsler 
(1958)  Adult  Intelligence  Scale,  which  is  intended 
to  measure  ability  to  think  rationally,  to  act 
purposefully,  and  to  deal  effectively  with  the 
environment.  Three  of  the  11  tests  have  been 
entertained  for  inclusion  in  PETER  (Arithmetic, 
Digit  Span,  and  Code  Substitution).  It  is  obvious 
that  they  were  chosen  because  alternate  forms  are 
relatively  easy  to  generate.  The  remaining  8 
tests,  which  have  not  yet  been  considered,  measure 
range  of  experience  (Information  and  Vocabulary), 
and  ability  to  analyze  and  synthesize  complex 
situations  (Picture  Completion,  Comprehension, 
Similarities,  Block  Design,  Picture  Arrangement, 
and  Object  Assembly).  It  is  apparent  that  we  have 
reviewed  the  atomistic  elements  of  the  Wechsler 
battery,  and  have  not  examined  the  molar  elements. 
Furthermore,  we  have  considered  the  symbolic  tests 
and  not  the  verbal  and  pictorial  tests. 

Table  2  shows  the  tests  in  the  Ekstrom, 
French,  Harman,  and  Dermen  (1976)  battery,  some  of 
which  have  been  offered  in  20  alternate  forms  by 
Moran,  Kimble,  and  Mefferd  (1964).  The  purpose  of 
this  factor-analytic  battery  of  72  cognitive  tests 
is  to  provide  research  workers  with  a  23-factor 
reference  system  for  comparison  of  studies  on 
mental  abilities.  Table  2  shows  that  9  of  the  23 
factors  are  represented  by  tests  that  have  been 
considered  for  inclusion  in  PETER.  However,  14 
factors  have  not  been  represented  in  PETER  by 
tests  from  Ekstrom,  French,  Harman,  and  Dermen 
(1976).  The  factors  which  are  not  represented  in 
PETER  by  these  tests  have  to  do  with:  identifying 
visual  configurations  in  noise  (Speed-of-Closure) , 
4  Fluency  factors  that  relate  to  rapidity  of  pro¬ 
ducing  non-repetitive  but  related  responses  (e.g. 
list  things  that  are  red).  Reasoning  (Inductive, 
Logical  and  General),  Memory  (Associative  and 
Visual),  Visualization  of  objects  assembled  by 
rotation  of  their  parts.  Flexibility  (Figural  and 
Use,  e.g.,  list  unusual  uses  for  a  given  cmnon 
object),  and  Verbal  Comprehension  (e.g.  Vocabu- 
l ary) . 

Table  3  lists  tests  of  manual  dexterity 
analyzed  bv  Fleishman  and  Ellison  (1962).  Thev 
show  that  their  hatterv  of  21  tests  can  he  repre¬ 
sented  bv  3  meaningful  factors:  Wrist-Finger 
Speed,  Finger  Dexterity,  Speed  of  Am  Movement, 
Manual  Dexterity,  and  Aiming.  Three  of  these  5 
factors  are  represented  by  tests  that  have  been 
considered  for  inclusion  in  DETER,  although  L’rist- 
Kinger  Speed  and  Aiming  are  both  represented  only 
bv  a  tapping  test.  Better  measures  of  each  of 
these  two  factors  are  suggested  by  Fleishman  and 
Ellison  (1962).  Factors  which  are  not  represented 
in  PETER  are  Finger  Dexterity,  and  Speed  of  Arm 
Movement . 


Table  4  reviews  tests  suggested  bv  Rose 
(1974,  1978)  as  representative  of  human  informa¬ 
tion  processing.  All  of  these  tests  have  been 
considered  for  inclusion  in  PETER  because  of 
their  construct  validity  and  because  Rose  (1974, 
1978)  has  suggested  how  to  produce  alternate 
forms . 

Table  5  recounts  the  tests  of  the  Halstead- 
Reitan  batteries  described  by  Reitan  and  Davison 
(1974).  Only  1  test  from  this  battery  has  been 
considered  for  inclusion  in  PETER.  The  purpose 
of  these  tests,  as  applied  by  Halstead  and  Reitan, 
is  to  provide  a  basis  from  which  inferences  may 
be  made  regarding  the  organic  integrity  of  the 
brain.  Most  of  the  tests  have  been  shown  to  be 
sensitive  to  brain  damage  (Reitan  &  Davison, 

1974).  It  seems  unlikely,  however,  that  some  of 
these  (e.g.  the  aphasia  screening  test)  would  be 
sensitive  to  the  range  of  variability  encountered 
in  normal  subjects.  Other  tests  in  the  battery 
(e.g.  Critical  Flicker  Frequency,  and  Lateral 
Dominance  Examination)  appear  to  have  little 
relation  to  the  work-related  abilities  at  which 
PETER  is  aimed.  However,  the  battery  offers  some 
unusual  tests  which  may  be  related  to  abilities 
that  are  occasionally  useful  (e.g.  Speech  Sounds 
Perception,  Rhythm,  Finger  Oscillation,  Steadiness, 
Ballistic  Arm  Tapping,  Orientation,  Sandpaper 
Test,  or  Tactile  Form  Recognition). 

Table  6  reveals  the  tests  included  in  the 
Duke  University  Environmental  Battery  (Bennett, 
2979).  This  battery  is  of  special  interest 
because  its  purpose  similar  to  that  of  PETER: 
detect  and  identify  changes  in  human  abilities 
caused  by  unusual  environments.  The  battery 
described  by  Bennett  (1979)  reflects  a  special 
interest  in  hyperbaric  environments.  Most  of  the 
tests  in  the  battery  have  been  discussed  in  this 
paper  in  connection  with  other  batteries,  although 
it  includes  a  unique  test  of  intentional  tremor 
which  has  not  yet  been  considered  for  inclusion 
in  PETER. 

Table  7  recalls  the  hatterv  of  24  memorv 
tests  which  was  factor  analyzed  bv  Underwood, 
Boruch,  and  Mai  mi  (1977).  The  contents  of  this 
battery  should  be  well  represented  in  PETER 
because  memory  plays  a  central  role  in  human 
performance.  Underwood,  et  al  .  found.  ’■> 

meaningful  factors  that  described  most  of  the 
variance  in  scores  on  their  tests.  The  factors, 
which  tended  to  he  related  to  the  tvpe  of  ^emorv 
tasb  rather  than  the  tvpe  of  material  h,»ing 
remembered ,  veer**:  Paired  Associates,  Free  Recall, 
Memory  Span,  Recognition,  and  Discrimination. 

Tests  of  two  of  these  factors.  Paired  Associates 
and  Discrimination,  have  not  yet  been  considered 
for  inclusion  in  PETER. 

Table  8  acknowledges  that  microcomputer- 
based  video  games  have  been  considered  as  perfor¬ 
mance  tests  for  possible  inclusion  in  PETER  (e.g. 
Kennedy,  Bittner,  A  .Tones,  1980).  This  source  of 
tests  is  so  new  that  it  is  difficult  In  compare 
its  content  with  that  of  traditional  tests. 

However,  we  have  found  that  the  Air  Gombat  Man- 


euvering  game  produces  scores  that  are  highly 
correlated  with  scores  from  traditional  compensa¬ 
tory  tracking.  Furthermore,  a  recent  factor  analy¬ 
sis  of  five  video  games  (the  first  5  in  Table  8) 
indicates  that  they  are  spanned  by  two  factors 
represented  by  Air  Combat  Maneuvering  and  Slalom 
games  (Kennedy,  Bittner,  &  Jones,  1980).  Such 
games  will  continue  to  be  selected  for  considera¬ 
tion  for  inclusion  in  PETER. 

Finally,  Table  9  assembles  some  miscellaneous 
oerformance  tests  which  have  been  examined  but  are 
not  from  an  established  battery.  The  Navigation 
Plotting  test  was  selected  because  it  is  a  task 
which  is  vital  in  the  Naval  context  which  motivates 
the  development  of  PETER.  The  Landolt  C  test  of 
visual  acuity  is  the  only  sensory  function  test 
that  has  been  considered  for  inclusion  in  PETER. 
Time  estimation,  multiple  choice  reaction  time, 
and  tracking  (performed  singly  and  in  dual  modes) 
were  tried  because  of  their  prominence  in  the 
armamentarium  of  performance  measurement. 

DISCUSSION 

Where  do  we  go  from  here?  It  is  obvious  that 
the  tests  that  have  been  considered  for  PETER  arc 
overwhelmingly  representative  of  mental  ability 
(i.e.  throughput)  tests.  We  believe,  however, 
that  tests  of  input  and  output  capabilities  should 
be  included  in  PETER  to  supplement  the  tests  of 
mental  mediation.  Some  tests  of  visual  perception, 
should  be  considered  such  as  contrast  sensitivity, 
dynamic  visual  acuity,  color  discrimination, 
accommodation,  visual  field  size,  fusional  reserve, 
visual  illusions,  vection  sensitivity,  pattern 
recognition,  and  visual  search.  Tests  of  auditory 
perception  may  also  be  worthwhile,  such  as  audio¬ 
metry,  the  Rhythm  test  (Table  5),  impedance  audio- 
metry.  Naval  Aviator's  Speech  Discrimination  Test, 
and  rhvning-word-list  tests  of  speech  perception. 
Tests  of  contaneous  information  processing  sug¬ 
gested  by  Table  5  may  also  he  of  interest.  In 
addition  to  these  tests  of  input  functions,  some 
mitmit  tests  may  he  of  interest.  For  example,  the 
most  rudimentary  fom  of  output  is  standing  erect. 
Tests  of  standing  steadiness,  postural  tremor,  and 
intentional  tremor  (such  as  the  Ball  Rearing  Test 
in  Table  h  or  the  steadiness  test  in  Table  5)  are 
examples  of  tests  of  fundamental  output  functions. 
Other  tests  of  output  were  suggested  hv  Table  3 
which  dealt  with  Fleishman  and  Ellison's  (1982) 
manual  performance  factors.  Tests  of  Finger 
Dexteritv  and  Speed  of  Arm  Movement  are  needed. 

The  latter  factor  was  also  suggested  by  Reitan  and 
Davison  (197*)  as  represented  in  Tabic  3  (Ballistic 
Arm  Tupping  test).  Additional  Independent  tests 
of  Aiming  and  Wrist-Finger  Speed  also  would  be 
prudent  selections. 


Paired  Associates  was  found  to  he  the  most 
influential  factor  in  the  Underwood,  et  al .  (1977) 
analysis  of  memory.  This  factor,  which  is  also 
reported  bv  Ekstrom  et  al .  (197b)  is  not  repre¬ 
sented  by  tests  already  considered  for  PETER. 

'tier  memorv  factors  (from  Tables  7  and  2,  respec¬ 


tively)  which  are  not  represented  by  PETER  tests 
are:  (a)  Discrimination  (e.g.  given  many  pairs 

of  words,  one  of  which  is  underlined  in  each 
pair,  underline  the  appropriate  word  when  one  of 
the  pairs  is  presented  again),  and  (b)  Visual 
Memory  (e.g.  reproduce  a  map). 

Review  of  Table  2  showed  that  there  were 
several  families  of  cognitive  factors  that  had 
not  yet  been  considered  for  inclusion  in  PETER: 
Speed  of  Closure,  Fluency,  Reasoning,  Visualiza¬ 
tion,  and  Flexibility.  These  are  important 
determinants  of  human  performance,  and  tests 
representing  them  should  be  investigated  for 
inclusion  in  PETER. 

Tests  of  human  information  processing  (Table 
4)  offered  by  Rose  (1974,  1978)  have  been  exhaus¬ 
tively  studied  for  inclusion  in  PETER.  No  addi¬ 
tional  tests  of  this  type  need  be  selected  unless 
a  new  and  important  information  processing  para¬ 
digm  becomes  available.  However,  many  of  the 
tests  already  considered  are  ideally  suited  for 
implementat ion  in  a  computer  controlled  form 
which  may  vastly  improve  the  tests  compared  with 
the  paper  and  pencil  forms  offered  b-  Rose  (1974 
1978). 

Video  games  should  continue  to  he  selected 
for  possible  inclusion  in  PETER  because  thev  are 
adaptive,  challenging,  and  interesting  to  perform. 
Interest  in  the  task  is  very  important  when 
repeated  measurements  are  to  be  made,  as  is 
common  in  environmental  research.  Furthermore, 
the  dynamic  nature  of  video  games  enables  them  to 
tap  aspects  of  mental  capability  that  are  unavail¬ 
able  to  paper  and  pencil  tests  and  seemingly  well 
related  to  operational  jobs. 

The  global  measures  of  performance  offered 
by  Wechsler  (1958)  and  listed  in  Table  1  are 
largely  not  amenable  to  repeated  measurement  due 
to  the  difficulty  of  creating  good  alternate 
forms.  Some  of  the  performance  factors  measured 
by  these  tests  may  be  assessed  by  other  means. 
Range  of  experience  could  be  represented  by  bio¬ 
graphical  items,  for  example.  Ability  to  analyze 
and  synthesize  complex  situations  may  be  assessed 
with  complex  exercises  such  as  war  games. 

To  summarize,  the  following  additional  types 
of  tests  should  be  selected  for  possihle  inclusion 
in  PETER: 

1.  Visual  Perception 

2.  Auditory  Perception 

3.  Tactile  Perception 

4.  Standing  Steadiness  &  Tremor 

5.  Finger  Dexterity 

6.  Speed  of  Arm  Movement 

7.  Aiming 

8.  Wrist-Finger  Speed 

9.  Paired-Associates  Memorv 

10.  Discrimination  Memory 

11.  Visual  Memorv 

12.  Speed  of  Closure 

13.  Fluency  (»  types) 


14.  Reasoning  (3  types) 

15.  Visualization 

16.  Flexibility  (2  types) 

17.  Video  games 

IS.  Complex  games  requiring  Analysis  and 
Synthes  is 

Tests  of  these  content  areas  are  available  in  the 
source  batteries  discussed  in  this  report,  but 
such  tests  have  not  yet  been  considered  for  inclu¬ 
sion  in  PETER.  It  is  recommended  that  attention 
be  given  to  tests  of  these  content  areas. 

REFERENCES 

Alluisi,  E.  A.  Methodology  in  the  use  of  synthe¬ 
tic  tests  to  assess  complex  performance. 

Human  Factors,  1966,  9_,  375-384. 

Bennett,  P.  B.  Personal  communication,  June  6, 
1979. 

Campbell,  D.  T.  ,  &  Stanley,  J.  C,  Experimental 

and  quasi-exper iment al  designs  for  research. 
Chicago:  Rand  McNally,  1966. 

Kkstrom,  R.  B.  ,  French,  J.  W.  ,  Harman,  H.  11.,  & 
Dermen,  D.  Manual  for^kit  of  factor-refer¬ 
enced  cognitive  tests.  Princeton,  New  Jersey: 
Educational  Testing  Service,  1976. 

F 1  e i s hman ,  F. .  A .  The  structure  and  measurement 
of  physical  fitness.  Englewood  Cliffs,  New 
Jersey:  Prent ice-Hal 1 ,  1964. 

Fleishman,  E.  A.,  &  Ellison,  G.  D.  A  factor 

analysis  of  fine  manipulative  tests.  Journal 
of  Applied  Psychology,  1962,  4_6 ,  96-105. 
Kennedy',  R.  !>.,  &  Bittner,  Jr.,  A.  C.  The  develop¬ 
ment  of  a  Navy  Performance  Evaluation  Test 
for  Environmental  Research  (PETER).  In  L.  T. 
Pope  &  D.  Meister  (Eds.),  Product ivity 
Enhancement :  Personnel  Performance  Assess¬ 

ment  in  Navy  Systems.  Symposium  presented  at 
the  Naval  Personnel  Research  and  Development 
Center,  San  Diego,  October  1977,  393-408. 

(NT IS  No.  AD  A05604  7) 

Kennedy,  R.  S.,  Bittner,  Jr.,  A.  C.,  &  Harbeson, 

M.  M.  An  engineering  approach  to  the  stan¬ 
dardization  of  Performance  Evaluation  Tests 
for  Environmental  Research  (PETER).  Pro- 
ceedings  of  the  11th  Annual  Conference  of 
the  Fnvj ronment al  Design  and  Research 
Association  (F.DRA)  ,  Charleston,  SC,  March 

1*980*. 

Kennedy,  R.  S.,  Rittner,  Jr.,  A.  C.,  &  Jones,  M. 

B.  The  utility  of  available  television- 
computer  games  for  assessing  performance  and 
cither  applications.  Proceedings  of  the  51st 
Annual  Scientific  Meeting  of  the  Aerospace 
Medical  Association,  1)3-64,  May  1980. 

Moran,  I..  .1 .  ,  Kimble,  J.  P.,  &  Mefferd,  R.  R. 

Repetitive  psvchonetric  measures:  Equating 
alternate  forms.  Psychological  Reports, 
lh64,  14,  335-338. 

Re  it  an,  R.  M.,  &  Davison,  A.  Cl  In  leal  Neuro- 
psvchology:  current  ^status  and  applications. 
New  York:  Halstead  Preass,  1974. 

Robb,  G.  P.,  Betnardoni,  I. .  C.  ,  f*  Jonson,  R.  W. 

Assessment  of  Individual  Mental  Ability.  New 
York:  In  text  Educational  Publishers,  1972. 


Rose,  A.  M.  Human  information  processing:  An 
assessment  and  research  battery.  Technical 
Report  No.  46.  Ann  Arbor,  Michigan: 

University  of  Michigan,  January,  1974. 

Rose,  A.  M.  An  information  processing  approach  to 
performance  assessment,  AIR  58500-1 1 /78-FR . 
Washington,  D.C.:  American  Institutes  for 
Research,  1978. 

Underwood,  B.  J.,  Boruch,  R.  F.,  &  Mai  mi,  R,  A, 

The  composition  of  episodic  memory,  F.vanston, 
Illinois:  Northwestern  University,  1977 
(NT IS  No.  AD-040-696). 

V/echsler,  D.  Measurement  and  appraisal  of  adult 

intelligence.  Baltimore:  Williams  &  Wilkins, 

’19587 

Winer,  B.  J.  Statistical  principles  in  experi- 

mental  design  (2nd  ed . ) .  New  York:  McGraw- 
Hil  l ,  1 97T. 

TABLE  1:  UECHSLF.R  (1958)  TESTS  AND  PETER 


TESTS 

CONSIDERED  FOR  PETER 

ARITHMETIC 

DIGIT  SPAN 

CODE  SUBSTITUTION 

NOT  YET  CONSIDERED 
INFORMATION 

VOCABULARY 

PICTURE  COMPLETION 

COMPREHENSION 

SIMILARITIES 

BLOCK  DESIGN 

PICTURE  ARRANGEMENT 

OBJECT  ASSEMBLY 


CONTENT3 


ARITHMETIC  PROCESSES 

RETF.NT I VENKSS  ,  AUDITORY- 
IMAGERY  ,  ATTENTION, 

AND  CONCENTRATION 

ROTE  RECALL,  VISUAL 
IMAGERY,  SPEED  AND 
ACCURACY  IN  LEARN  IN G- 
AND  WRITING  SYMBOLS 


RANGE  OF  INFORMATION, 
EXPERIENCE 

RANGE  OF  IDEAS,  CONCEPT 
FORMATION  LANGUAGE 
DEVELOPMENT 

VISUAL  IMAGERY,  PERCEP¬ 
TION  AND  ALERTNESS, 
CONCENTRATION 

SOCIAL  JUDGEMENT, 
REASONING,  ORGANIZATION 
AND  APPLICATION  OK 
KNOWLEDGE 

VERBAL  CONCEPTS, 

ABSTRACT  THINKING, 

FORM  PERCEPTION,  ANALYSIS 
AND  SYNTHESIS 

ABILITY  TO  COMPREHEND  A 
WHOLE  SITUATION 

VISUAL  PERCEPTION  AND 
SYNTHESIS,  RECOGNITION 
OF  PATTERNS 


Robb,  Rernardoni,  and  Johnson  (1972) 


4 


TAB!.!:  2:  EKSTROM,  FRENCH,  HARMAN,  AND  HERMAN  (1976) 
AN D  MORAN,  KIMBLE,  AND  MEFFF.RD  (1964)  TESTS  AND 

PETER 


TABLE  3:  FLEISHMAN  AND  ELLISON  (1962)  MANUAL 
DEXTERITY  TESTS  AND  PETER 


CONSIDERED  FOR  PETER 


HIDI'-N  WORDS 
WORD  BEN  INN  I  NOS 
CALENDAR  test 

A'DiiTORY  DIC IT  SPAS 
A!<  I  TUMK'I  IC  OPERATIONS'’ 
add  i  nos 

:  INDIN:,  A,  NUMBER 

COMPARISON  ,  NUMBER 
CROSS  OUT'1 
EE  HER  ROTA  r  I  ON 
E  !N:'  EMI. I.. v ; I S« ;‘l 

:;,c-  ■  considered1’ 


SNOWY  pictures 

opi-osi  EES 
MAR  I  SC  SENTENCES 
ORNAMENTATION  oe 
s  INPI  r  El  CURES 
LIST  THINGS  that 
SHARE  A  C I  YEN 
CHARACTERISTIC 
INDUCE  mi  RULE  IN 
A  CROUP  OF  LETTER  SETS 
FIRST  AND  LAST  NAMES 
MAE  MEMORY 
NECESSARY  ARITHMETIC 
OPERATIONS 
SYLLOGISMS 

vi  h:\buiary 

SURFACE  DEVEEOPMEN1 
PEANNINO  PATTERNS 
"  I  FEE. RENT  USES  OF 
C  o' MON  OR, no:  IS 


FLKXIHILITY  OF 
CLOSURE 

VERBAL  CLOSURE 

WORD  FLUENCY 

INTEGRATIVE 

PROCESSES 

MEMORY  SPAN 

NUMBER 

FACILITY 

PERCEPTUAL  SPEED 


SPATIAL  ORIENTATION 
SPATIAL  SCANNING 


SPEED  OF  CLOSURE 
ASSOCIATION.!!.  FLUENCY 
EXPRESSION.!!.  FLUENCY 
FIGL’RAL  FLUENCY 

IDEATIONAL  FLUENCY 


ASSOCIATIVE  MEMORY 
VISUAL  MEMORY 
GENERAL  REASON  INC 

LOGICAL  REASONINC 
VERBAL  COMPREHEN¬ 
SION 

VISUALIZATION 
kicurai,  fi.exibili ry 

El. EX  I  BI  E I  IT  OF  USE 


'  r  -v‘  lie: 
in  PE  i’i 


from  Moran, 
s  list. vi  ns  . 
h  is  not  vo t 

K. 


nhl*1,  iinl  Mofferd  (19D4) 
i’x  i  i;» lo  nf  nnrli  *'.Ktnitiv. 
:»n  cimidorod  for  ineln- 


TESTS 

CONTENT 

CONSIDERED  FOR  PETER 

AIMING  (TAPPING 

AIMING,  WRIST-FINGER 

SMALL  CIRCLES) 

SPEED 

MINNESOTA  RATE  OF 

MAN I PULAT ION :  PLACING 

MANUAL  DEXTERITY 

TURNING 

MANUAL  DEXTERITY 

NOT  YET  CONSIDERED 

CONTENT 

MEDIUM  TAPPING 

WRIST- FINGER  SPEED 

LARGE  TAPPING 

WRIST-FINGER  SPEED 

PURSUIT  AIMING:  I,  II 

AIMING 

SQUARE  MARKING 

UNIQUE 

TRACING 

UNIQUE 

STEADINESS 

UNIQUE 

DISCRIMINATION  REACTION 

MR  I  ST- F INTER  SPEFD, 

TIME  (PRINTED) 

MANUAL  DEXTERITY 

PRECISION  STEADINESS 

UNIQUE 

TF.N-TARGET  AIMING: 

SPEED  OF  ARM  MOVE- 

ERRORS,  CORRECTS 

MENT 

HAND  PRECISION  AIMING: 

SPEED  OF  ARM  MOVE- 

ERRORS,  CORRECTS 

MENT 

PIN  STICK 

FINGER  DEXTERITY 

PURDUE  PEGBOARI) 

FINGER  DEXTERITY 

O'CONNOR  FINGER  DEXTERITY 

FINGER  DEXTERITY 

TABLE  4:  ROSE  (1974,  1973) 
TESTS  AND  PETER 

INFORMATION  PROCESSING 

TEST 

CONTENT 

CONSIDERED  FOR  PETER 

LETTER  ROTATION 

ROTATION 

NF.ISSER  SEARCH 

DECISION  TIME 

STERNBERG  ITEM  RECOGNITION 

MEMORY  SCANNING 

LETTER  RECALL  (DIGIT  SPAN) 

ROTE.  MEMORY 

MENTAL  ADDITION 

transformation. 

STORAGE,  RE TE  I  EVA, 

GRAMMA  1  K.'Al.  REASONIN'". 

VERBAL  ABILITY 

SEMANTIC  MEMORY 

ACCESS  LONG  TERM 

MEMORY 

GRAHirin:  A  PHONEMIC 

ACCESS  LONG  TERM 

ANAL VS  l  S 

MEMORY 

pOO-UR  i .  1"  i  r I" K 

STORAGE  AND 

H  AS^-lEiE AVION 

RETRIEVAL 

I  rX  IGAI  DEG  I S  I  ON  MAE  TNG 

ACCESS  LONG  n:Rv* 

MEMORY 

Fill’S  I‘ APE I  NO 

I  NEORMA  V  !  ON  *'RO- 

CESSING  RATI 

critical  I  pa  i  *:g 

CONTROL  lOi'i*  jn;  i  \V 

c; ;  *;;  s 

Ri'S'C'NSf  ft  'MU’  ;  {  D’N 

TABLE  S:  RE IT AN  AN!)  DAVISON  (1974)  TESTS  AND  PETER 


TEST 

CONTENT 

CONSIDERED  FOR  PETER 

TRAIL  MAKING 

RAPID  TAPPING  IN  A 
SPECIFIC  PATTERN 

NOT  YET  CONSIDERED 

CONTENT 

CATEGORY  TEST 

VISUAL  FIGURE  IDEN¬ 
TIFICATION 

I* A< ’TEAL  PER E< > R.MA N ( E  T E S  !' 

TACTILE  FIGURE  RECOG¬ 
NITION,  and  assembly 

RHYTHM  TI’.ST 

comparison  of 

RHYTHMIC  SEQUENCES 

SPKi'.CH  SOUNDS  PERCEPTION 

DISCRIMINATE  CORDS 

TES  ! 

FROM  ALTERNATIVES 

ki:.t,i;k  osmuxtion  rrsr 

SPEED  OK  FINGER 
TAPPING 

i  ki  r n:,\i.  fi.icffr  krkoit.xcy 

FUSION  OF  A  FLASH  IN': 

1. 1  OUT 

SP?  A’MN'-SS  SAT  VERY 

COORDINATION  AND 
TRKMOR 

!  \  !T.:L\I.  DOMINANCE  EXAM- 

HAND,  FOOT,  AND 

I  NATION 

EYE  DOMINANCE 

RANGE  \CHl  !A'!:M!'.N' T 

READING  ,  SPELL  INC. , 

test 

v:rprmo 

•'[NN'KSnTA  'TUT  l*'l\S  Ik 

PER  S'  )NAI.f  TY 

PERSONAL  I  rv  I'A’HN  TORY 

APHASIA  SCRI-ENINC  IKS  i 

VERBAL  expression 

BAi.usrn:  arm  TAPi'ixi; 

large  arm  movements 

ilRIKSTATIOM  TKST 

RlCHT-l.F.FT  RFOKINI- 
T ION  f,  im'.NTIFICATION 

DYNAMOMETER 

GRIP  STRENGTH 

SANDPAPER  TEST 

KVA I  "AIK  TKXrrRF 

V I  SIMM.  StACK  ROTATION 

DR  AN  "X"  V.’ITH  ROTATED 
VISION  OF  HAND 

TACTILE  >'orm  IDKGOCN  ITI'iN 

TAG  f  I  i  !'  FOR*! 

;•  ES  M 

RECOGNITION 

p  ic'itre  vocaiviary  test 

GIVE  NAMES  OK  PIC¬ 
TURE  OBJECTS 

TAREK  A  :  DUKE  UNIVERSITY  ENVIRONMENTAL  BATTERY 
TESTS**  AND  PETE0 


TEST 

CONTENT 

CONSIDERED  FOR  PETEK 

ARITHMETIC 

STRODE,  COLOR,  AND 

CONTROL 

BADDELEY'S  GRAMMATICAL 
REASONING 

DIGIT  SPAN 

NUMBER  COMPARISON 

NUMBER  *-7 
RESPONSE 

VK  R  m  *.  i  \  i 

MEMORY 
PERCEPT I 7 

cn.m’ 

COMPETITION 

I  E  I  Tv 

NOT  YET  CONSIDERED 

CoNTEN T 

BALL  BEAR I’N,  TEST 

PCRDU!'.  i'AY  board 

BENNETT  HAND  TOOL  '>\X- 
TER1TY  TEST 

INTENT  Id 
EINGER  Dr 

MANUAL  D! 

A!  TREND” 

XT  EM  I  rv 

XT  HR  I  TY 

BFNNFTT  il'L'n 

CABLE  ?:  UN'Di’R'.’O-.  'D ,  •■:or:vji4 

[  '•'.STS  OF  MEMORY  AN"'  P1-' TCP 

and  m.\;.mi  f  I f < "  “ : 

TEST 

CO  NT!. NT 

CONc 1 DFRFD  TOR  PETER 

FREE  RECALL-CONTROL 

FREE  RF.C ALE-CONCRETE  ‘LORDS 
FREE  RECALL-ABSTRACT  CORDS 
LIST  DIFFERENTIATION 

RUNNING  RECOGNITION 

DIGIT  SPAN 

INTERFERENCE  SUSCEPTI¬ 
BILITY 

FREE  RECALL 

FREE  RECALL 

FREE  RECALL 

FREE  RFC  AT... 
RECOGNITION 

MEMORY  SPAN 

UNIQUE 

NOT  CONSIDERED  FOR  PETER 

CONTENT 

K.C.  PAIRED  ASSOCIATES, 

PAIRED  ASSOCIATES 

SERIAL  LEARN  I  NT’. 

K.G.  VERBAL  DISCRIMINATION 


DISK KIM  I  NATION 


TABLK  S:  ATARrR  GAMES  AND  PETER 


TEST 


CONTENT 


CONSIDERED  FOR  PF.TF.R 


AIR  COMBAT  MANEUVERING 
(ACM) 

S  LALOM 

BREAKOUT 

RACECAR 

SURROUND 

ICE  RACE 

PONC 

BASKETBALL 
ANTI-AIRCRAFT 
FLAG  CAPTURE 


COMPENSATORY 

TRACKING 

UNKNOWN 

SAME  AS  ACM 

SAME  AS  SLALOM 

ACM  AND  SLALOM 

UNKNOWN 

UNKNOWN 

UNKNOWN 

UNKNOWN 

UNKNOWN 


TABLE  9:  MISCELLANEOUS  TESTS  AND  PETER 


TEST 


CONTENT 


CONSIDERED  FOR  PET E R 

NAVIGATION  PLOTTING 

LANDOLT  C 
T DIE  ESTIMATION 

MULTIPLE  CHOICE  REACTION 
TLMF. 

DUAL  CRITICAL  TRACKING 
COMPENSATORY  TRACKING 


MANEUVERING  BOARD 
SOLUTIONS 
VISUAL  ACUITY 
CONTINUITY  OF 
ATTENTION 
REACTION  TIME 

TIME  SHARING 
TRACKING 


PROCEEDINGS  OF  THE  24TH  ANNUAL  MEETING  OF  THE  HUMAN  FACTORS  SOCIETY 
LOS  ANGELF.S,  CA,  13-17  OCTOBER  1980 


1 


A  CATALOGUE  OF  PERFORMANCE  EVALUATION  TESTS  FOR  ENVIRONMENTAL  RESEARCH 

Robert  S.  Kennedy,  Robert  C.  Carter,  and  Alvah  C.  Bittner,  Jr. 

Naval  Kiodvnamics  Laboratory,  New  Orleans,  LA  70189 

ABSTRACT 

Performance  Evaluation  Tests  for  Environmental  Research  (PETER)  are  under  development  at  the  Naval 
Biodvnamics  Laboratory  and  supporting  organizations.  The  tests,  or  tasks,  studied  in  this  program  have 
been  largely  derived  from  the  literature.  Each  task  was  evaluated  for  suitability  for  repeated  measures 
experimental  designs  which  are  almost  universally  used  in  environmental  research.  Suitability  criteria 
included,  the  "stability"  of  task  means,  standard  deviations,  and  between  trial  correlations.  The 
magnitude  of‘  trie  "stabilized"  bet ween-tr i al  correlations,  task  definition,  was  also  examined  with 
respect  c>  the  admin istrat ion  time.  There  are  60  active  tasks  in  the  present  program.  All  tasks  examined 
to  date  exhibit  stable  means  and  variances  after  adequate  practice  but:  (a)  less  than  30?  meet  minimal 
stability  criteria  for  intertrial  correlations;  and  (b)  substantial  practice  (typically  more  than  an 
hour  over  five  days)  is  required  to  achieve  stability.  A  tabular  catalogue  of  the  research  findings 
and  background  for  15  tasks  is  presented  and  discussed. 


INTRODUCTION 

Background 

An  engineering  approach  to  the  development 
and  standardization  of  a  battery  of  Performance 
Evaluation  Tests  for  Environmental  Research  (PETER) 
is  underway  under  the  direction  of  the  Naval 
Biodvn.imics  Laboratory.  This  approach  involves 
test  and  evaluation  of  performance  tasks  prior  to 
their  being  enp loved  in  the  assessment  of  environ¬ 
mental  effects.  The  goal  of  this  effort  is  to 
ensure  that  selected  tasks  will  be  suitable  for 
simple  analysis  and  interpretation  when  employed 
in  repea  ted -measures  experiments  (Kennedy  &  Bittner, 
lg77;  Kennedv,  Bittner  &  Mar be son,  1980).  The 
emphasis  is  on  statistical  requirements  for  re¬ 
pea  ted -measures  experimental  designs  because 
envi ronnental  research  usually  includes  measurement 
of  performance  before,  during,  and  after  exposure 
to  an  unusual  environment. 

The  criteria  for  suitable  stability  of  tests 
used  in  repeated  measures  experiments  have  been 
delineated  bv  Tones  (1930)  and  Kennedy  et  al . 

(1930).  These  authors  have  suggested  that  "stabi¬ 
lity"  exists  when:  (a)  group  mean  performance  in 
a  standard  environment  has  reached  an  asymptote  or 
evidences  a  slight  constant  slope,  (b)  dav-to-dav 
bet ween-sub ject  variance  is  constant,  and  (c) 
relative  performance  standings  among  subjects,  as 
indicated  by  intertrial  correlations,  are  constant 
from  day  to  day.  The  importance  of  task  stabilitv 
has  not  been  fuliv  recognized  in  the  development 
of  previous  batteries.  Without  stability,  changes 
of  the  means  during  a  re pea  ted -measures  experiment 
are  not  interpretable  (Campbell  &  Stanley,  1963). 

In  addition,  stability  ensures  that  the  assump¬ 
tions  of  repeated  measures  analysis  of  variance 
are  met  (Winer,  1971).  Further,  stabilitv  verifies 
the  temporal  general  izahi  1  i tv  (C.ronbach,  Gleser, 
Nanda,  &  Ra  jaratnam,  19  72)  of  sub  jects'  scores. 
Lastly,  stability  ensures  that  what- is-be ing- 
mousured  does  not  change  over  time  (Alvaros  6 
Hu  I  in,  1972;  Tones,  1980).  As  defined  by  Tones 
(1930),  stability  represents  the  properties  which 
must  be  met  for  statistically  and  scientifically 
meaningful  repeated-measures  experiments. 


In  addition  to  stability,  a  test  should  be 
sensitive  to  environmental  effects  which  are  re¬ 
flected  in  changes  of  the  mean  score  associated 
with  changes  in  treatment.  Sensitivity  t'  a 
change  of  the  mean,  it  is  pertinent  to  note,  is 
enhanced  by  a  large  intertrial  correlation  (Winer, 
1971).  Figure  1  is  a  nomogram  which  shows  the 


Figure  1.  Nomogram  showing  the  minimum  statisti¬ 
cal  1  v-s  igni  f  icant  difference  (p<_,05)  between  two 
trials  of  a  repeated-measures  experiment  with 
sample  size  N  and  intertrial  correlation  R. 

relationship  between  intertrial  correlation  (R) , 
sample  size  (N),  and  the  minimum  stat  1st  ical  1  y 
significant  difference  (p  ^.05)  between  standard¬ 
ized  scores  from  two  trials  of  a  repeated-measures 
experiment.  The  nomogram  is  based  on  the  equation 
given  bv  Winer  (1971)  for  testing  differences  be¬ 
tween  means  of  corr elated  observations.  Hie 
figure  shows,  for  example,  that  if  one  sets  out 
to  detect  a  mean  change  that  exceeds  .2  standard 
deviations  (one  tailed  test),  and  if  20  subjects 
are  available,  then  a  task  definition  of  .85  is 
required.  Furthermore,  the  same  significance 
level  can  be  obtained  for  a  mean  difference  of 
.3  standard  deviations  when  N  =  5,  R  «  .90;  or 
N  =  33,  R  =  .45;  or  N  =  60,  R  =  0.  This  nomogram 
emphasizes  the  importance  of  intortrial  corre¬ 
lation  in  tin*  design  of  repeated  measures  experi¬ 
ments:  a  little  in  ter trial  correlation  saves  a 
lot  of  subjects. 


Purpose 

The  primary  purpose  of  this  report  is  to  pre¬ 
sent  a  description  of  the  stability  and  other 
characteristics  of  15  performance  tasks  which  have 
been  investigated  as  part  of  the  PETER  Program.  A 
secondary  purpose  is  to  report  progress  on  another 
45  additional  tests  which  are  being  studied.  The 
goal  of  these  presentations  is  to  provide  informa¬ 
tion  useful  t£  other  investigators  engaged  in  en¬ 
vironmental  research. 


METHOD 

The  approach  employed  is  to  summarize  infor¬ 
mation  about  candidate  performance  tasks  in  a  tabu¬ 
lar  format.  Twenty  of  the  most  relevant  task 
characteristics  were  selected  for  presentation 
under  two  broad  categories:  (a)  Background 
Information,  and  (b)  Statistical  Properties. 
Background  Information  included  the  ten  charac¬ 
teristics  defined  in  Table  1.  Stability  and 

TABLE  1:  DEFINITIONS  OF  BACKGROUND  INFORMATION 
CHARACTERISTICS  OF  PERFORMANCE  TASKS 


CHARACTERISTIC  DEFINITION 


1. 

SOURCE  REFERENCE 

2. 

PETER  REFERENCE 

3. 

VALIDITIES 

4. 

VERIFICATION 

5. 

INDIVIDUAL/ 

GROUP 

b. 

TEST  MODE 

7. 

TEST  TIME  IN 
SECONDS 

i. 

SCORE 

9 . 

N 

.0. 

COMMENTS 

LITERATURE  SOURCE  DES¬ 
CRIBING  THE  TASK 
REPORT  ON  THE  TASK  SUB¬ 
JECTED  TO  PETER  INVESTI¬ 
GATION 

TYPES  OF  VALIDITIES  TASK 
POSSESSES  (CONTENT,  CON¬ 
STRUCT,  PREDICTIVE,  FACE) 
CONTEXTS  WHERE  TASK  HAS 
BEEN  FOUND  SENSITIVE 
TYPE  OF  ADMINISTRATION 

APPARATUS  REQUIRED  (F..C.. 
PAPER  &  PENCIL,  T. V . , 
AUDIO  VIEWER,  TIMER) 

TEST  LENGTH  IN  SECONDS 
IN  THE  PETER  EXPERIMENT 
TYPE  OF  SCORE  (I.F., 

HITS,  7,  CORRECT,  SLOPE, 
NUMBER  ATTEMPTED, 

LATENCY) 

SAMPLE  SL/.E  FOR  WHICH 
DATA  ARE  AVAILABLE 
CHARACTER  r. ST  ICS  WHICH  DID 
NOT  FALL  CONVENIENTLY  LN- 
70  OTHER  CATEGORIES 


sensitivity  are  described  by  the  ten  properties 
defined  in  Table  2.  Most  of  the  characteristics 
tnd  properties  listed  In  Tables  1  and  2  are  easily 
understood .  However,  the  "standardized  reliabi- 
! i tv"  (Table  2,  Item  S)  may  require  an  explanation, 
is  the  value,  estimated  bv  the  Spearman-Brown 
■  mula  (c.f.  Winer,  1071),  that  the  intertriu] 

- -rre ! a t i on  would  have  had  if  the  test  had  lasted 
•  roe  ojnutos.  Standardized  rel lability  is  useful 


TABLE  2:  DEFINITIONS  OF  STATISTICAL  PROPERTIES 


OF  PERFORMANCE  TASKS 

PROPERTY 

DEFINITION 

1. 

DAY  X  STABILIZES 

DAY  AT  WHICH  MEAN  REACHES 

STABILITY 

2. 

X 

VALUE  OF  MEAN  AT  DAY  STA¬ 

BILITY  IS  REACHED 

3. 

b 

VALUE  OF  SLOPE  OF  SCORES 

m 

DURING  STABLE  PERIOD 

4. 

DAY  S.D. 

DAY  AT  WHICH  STANDARD  DE¬ 

STABILIZES 

VIATIONS  BECOME  STABLE 

5. 

S.D. 

VALUE  OF  STABLE  S.D. 

6. 

DAY  R 

DAY  AT  WHICH  INTERTRIAL 

STABILIZES 

CORRELATION  (R)  STABILIZES 

7. 

TASK 

VALUE  OF  R  DURING  STABLE 

DEFINITION 

PERIOD 

a. 

STANDARDIZED 

CALCULATED  BY  USING  THE 

RELIABILITY 

SPEARMAN-BROWN  FORMULA 

US  INC  A  THREE  MINUTE  BASE 
(C.F. ,  FIGURE  2) 

9. 

OVERALL 

DAY  AT  WHICH  ALL  FORMS  OF 

STABILITY 

STABILITY  ARE  PRESENT 

10. 

SENSITIVITY 

DEGREE  TO  WHICH  STAN¬ 
DARDIZED  RELIABILITY  EX¬ 

CEEDS  r  -  .707. 

for 

comparing  reliabilities  of  tests  with  differ 

ent 

administration  times, 

If  such  a  comparison 

were 

made  without  regard 

to  test  administration 

t  iroe 

,  then  a  test  with  a 

longer  administration 

time 

would  tend  to  be  favored  because  reliability 

increases  with  test  length.  Figure  2  shows  the 
tradeoff  of  test  time  and  reliability,  according 
to  the  Spearman-Brown  formula.  Standardized  re¬ 
liability  allows  comparisons  of  reliabilities  of 
tests  for  any  arbitrary  (in  this  case,  3  minute) 
administration  period. 


0.0  0.1  0.2  0.3  0.3  0.3  0.8  0.3  0.8  0.8  1.0 


WU»*ll.ITY 

Fieure  2.  Tradeoff  between  intertrial  correla- 
t ion  and  test  t  . 


9 


RESULTS  AND  DISCUSSION 

Fifteen  completed  appraisals  of  Performance 
Evaluation  Tests  for  Environmental  Research  are 
summarized  in  Table  3.  Some  of  the  tests  provide 
multiple  scores.  For  example,  the  item  recogni¬ 
tion  test  yields  a  reaction  time,  slope  and  inter¬ 
cept.  Because  each  score  has  its  own  properties 
and  interpretations*  the  scores  are  represented  by 
separate  rows  of  Table  3.  The  first  10  columns  of 
Table  3  list  general  characteristics  (defined  in 
Table  1)  for  each  score.  The  remaining  columns  of 
Table  3  summarize  the  statistical  results  of  the 
test  assessments  (defined  in  Table  2).  Note  that 
each  score's  mean  stabilizes  eventually  (reaches 
constant  slope) .  The  mean  (X)  at  the  day  stabi¬ 
lity  was  attained,  and  the  slope  (b  )  that  pre¬ 
vailed  thereafter  are  listed  in  order  that  the  mean 
on  any  particular  stable  day  can  be  calculated.  In 
contrast  to  the  means,  which  usually  required  sev¬ 
eral  sessions  to  stabilize,  the  standard  deviations 
(S.D.)  stabilized  rapidly,  usually  during  the  first 
or  second  day  of  testing.  At  the  other  extreme, 
some  of  the  intertrial  correlation  matrices  never 
stabilized.  Instead  they  exhibited  superdiagonal 
form  (Alvares  &  Hulin,  1972)  throughout  the  15  days 
of  testing.  However,  most  tests  do  provide  stable 
intertrial  correlations  after  several  sessions  of 
testing.  Only  a  few  of  these  tests  have  a  credit¬ 
able  task  definition.  If  it  is  required  that  the 
test  predict  at  least  50%  of  its  own  variance  in 
later  sessions,  then  task  definition  would  have  to 
be  in  excess  of  .7.  The  extent  to  which  this  sensi¬ 
tivity  criterion  was  met  by  each  test  is  shown  in 
the  final  column  of  Table  3.  The  penultimate  column 
lists  the  days  on  which  each  test  has  stable  means, 

S.D.  ,  and  intertrial  correlations.  Considering  both 
stability  and  sensitivity,  six  of  the  tests  in  Table 
3  are  recommended  for  inclusion  in  test  batteries 
for  environmental  research  using  repeated  measures: 
(1)  Grammatical  Reasoning,  (2)  Stroop,  (3)  Air  Com¬ 
bat  Maneuvering,  (4)  Code  Substitution,  (5)  Arith¬ 
metic,  and  (6)  Tapping. 

Forty  five  additional  tests  are  equally  dis¬ 
tributed  among  the  three  stages  of  appraisal: 
planning,  data  gathering,  and  analysis.  More 
tests  will  be  added  to  the  program  later.  When 
the  program  was  begun,  it  was  assumed  that  150  to 
200  tests  would  be  assessed  to  provide  enough 
stable,  sensitive  tests  to  characterize  human 
performance.  Now,  it  is  suspected  that  100  tasks 
may  suffice  because  several  studies  of  tests 
representing  presumably  orthogonal  factors  have 
shown  convergence  (increased  correlation)  between 
the  tests  with  extended  practice  (Kennedy,  Bittner, 

&  .Jones,  1980;  Jones,  Kennedy,  &  Bittner,  1980; 
McCafferty,  Bittner,  &  Carter,  1980). 

It  is  anticipated  that  this  is  the  first  of 
many  catalogues  of  Performance  Evaluation  Tests 
for  Environmental  Research.  The  tabular  form  of 
the  catalogue  is  intended  to  provide  useful  infor¬ 
mation  to  environmental  researchers  in  a  succinct 
form.  For  instance,  one  may  estimate  the  amount  of 
distributed  practice  required  for  stability  by 
multiplying  "Administration  Time"  by  "Day  X 
Stabilizes".  Furthermore,  the  catalogue  provides 


information  which  may  he  used  in  conjunction  with 
Figures  1  and  2  to  plan  sample  size,  testing 
time,  and  minimum  detectable  effects  for  repeated- 
measures  experiments.  However,  the  format  of 
the  catalogue  is  tentative.  The  authors  encourage 
suggestions  for  a  revised  format  to  he  used  in 
future  catalogues. 

REFERENCES 

1.  Alvares,  K.  M. ,  &  Hulin,  C.  L,  Two  explana 

tions  of  temporal  changes  in  ability- 
skill  relationships:  A  literature 
review  and  theoretical  analysis. 

Human  Factors,  1972,  14,  295-308. 

2.  Baddeley,  A.  D.  A3  min  reasoning  test 

based  on  grammatical  transformation. 

Psyc honomic  Sc ience ,  1968,  J_0,  341-342. 

3.  Campbell,!).  T.  ,  &  Stanley,  J.  C.  Experi¬ 

mental  and  quasi-exper imen tal  designs 
for  research.  Chicago:  Rand  McNally, 
1966“ 

4.  Carter,  R.  C . ,  Kennedy,  R.  S.,  &  Bittner,  Jr., 

A.  C.  Grammatical  reasoning:  A  stable 
performance  yardstick,  unpublished 
manuscript,  1980. 

5.  Carter,  R.  C.,  Kennedy,  R.  S.,  Bittner,  Jr., 

A.  C. ,  &  Krause,  M.  Item  recognition 
as  a  performance  evaluation  test  for 
environmental  research.  Proceedings  of 
the  2 4th  Annual  Meeting  of  the  Hunan 
Factors  Society,  1980. 

6.  Cronbach,  L.  J.,  Gleser,  G.  C.,  Nanda,  H.,  & 

Rajaratnam,  N.  The  dependability  of  be_- 
havioral  measurements.  New  York:  John 
Wiley,  1972. 

7.  Damos,  D.  L. ,  Kennedy,  R.  S.,  Bittner,  Jr., 

A.  C. ,  &  Harbeson,  M.  M.  Effects  of 
extended  practice  on  du al-task  t raining . 
Paper  presented  at  the  87th  Annual  Con¬ 
vention  of  the  American  Psychological 
Association,  1979. 

8.  Damos,  D.  L. ,  Kennedy,  R.  S.,  &  Bittner,  Jr., 

A.  C.  Development  of  a  performance  eval¬ 
uation  test  for  environmental  research 
(PETER):  Critical  tracking  test. 
Proceedings  of  the  50th Annual  Scientific 
Meeting  of  the  Aero s p ace  Medic a  1 
Assoc iat ion ,  1979,  33-34. 

9.  Kennedy,  R.  S. ,  Bittner,  Jr.,  A.  0. ,  &  Jones, 

M.  B.  Exploratory  studies  of  tracking 
tasks.  Unpublished  manuscript,  1980. 

10.  Harbeson,  M.  M.,  Kennedy,  R.  S.,  &  Bittner,  Jr., 

A.  C,  A  comparison  of  the  Stroop  test  to 
other  tasks  for  studies  of  environmental 
stress.  Proceedings  of  the  l 2th  Annual 
Meeting  of  the  Human  Factors  Association 
of  Canada,"  19*79,  2T.V-2 1 . 9'.' ‘ " 

11.  Jex ,  H.  R ,  ,  McDonnell,  J.  D.,  S,  Phatak,  A.  V. 

A  "critical  tracking  task  for  manual  con¬ 
trol  research.  IEEE  transactions  on  human 
factors  in  electronics.  1966,  HFE-7: 
138-145. 

1 2 .  Jones ,  M .  B .  Stabil i zatlon  and  task  de f in i- 

tionin  a  per forma ncetest  bat] terv  . 

(NBDL  Monograph  No.  M-0001)  New  Orleans, 

LA:  Naval  Biodvnnmics  Laboratory,  1980. 


10 


13.  Jones.  M.  B,,  Kennedy,  R.  S.,  &  Bittner,  Jr., 

A.  C.  Video  games  and  convergence  or 
divergence  with  practice.  Proceedings  of 
the  Seventh  Psychology  in  tTiV  POD 
Symposium,  USAF  Academy,  Colorado  Springs, 
CO,  16-18  April  1980. 

14.  Kennedy,  R.  S.,  &  Bittner,  Jr.,  A.  C.  The 

development  of  a  Navy  Performance  Evalua¬ 
tion  Test  for  Environmental  Research 
(PETER).  In  L.  T.  Pope  &  D.  Meister 
(Eds.),  Productivity  Enhancement;  Per¬ 
sonnel  Performance  Assessment  in  Navy 
Systems .  Symposium  presented  at  the 
Naval  Personnel  R  &  D  Center,  San  Diego, 
October  1977,  393-408.  (NTIS  No. 

AD  A045047) 

15.  Kennedy,  R.  S.,  &  Bittner,  Jr.,  A.  C.  Develop¬ 

ment  of  performance  evaluation  tests  for 
environmental  research  (PETER):  complex 
counting.  Aviation,  Space,  and  Environ¬ 
mental  Medicine,  1980,  JH,  142-144. 

16.  Kennedy,  R.  S.  ,  &  Bittner,  Jr.,  A.  C.  The  uti¬ 

lity  of  commercially  available  television- 
computer  games  for  assessing  performance 
and  other  appl i cations .  Proceedings  of 
the  51st  Annual  Scientific  Meeting  of  the 
Aerospace  Medical  Association,  1980. 

17.  Kennedy,  R.  S. ,  Bittner,  Jr.,  A.  C. ,  & 

Einbender,  S.  W.  Development  of  perfor¬ 
mance  evaluation  tests  for  environmental 
research  (PETER):  tr all  making  test". 
Unpublished  manuscript,  1980. 

18.  Kennedy,  R.  S.,  Bittner,  Jr.,  A.  C.,  & 

Harbeson,  M.  M.  An  engineering  approach 
to  the  standardization  of  performance 
evaluation  tests  for  environmental  re¬ 
search  (PETER) .  Proceedings  of  the  11th 
Annual  Conference  of  the  Environmental 
Design  Research  Association  (EDRA) , 
Charleston,  SC,  2-6  March,  1980. 

19.  Krause,  M.,  &  Kennedy,  R.  S.  Performance 

evaluation  tests  for  environmental 
research  (PETER):  Interference  sus¬ 
ceptibility  test  (1ST).  Proceedings  of 
the  7th  Psychology  in  the  POD  Symposium . 
USAF  Academy,  Colorado  Springs,  CO, 

1980. 

20.  McCafferty,  D.  B.,  Bittner,  Jr.,  A.  C.,  & 

Carter,  R.  C.  Performance  evaluation 
tests  for  environmental  research 
(PETER):  Auditory  digit  span. 

Proceedings  of  the  24th  Annual  Meeting 
of  the  Human  Factors  Society,  1980. 

21.  McCauley,  M.  E.,  Kennedy,  R.  S.,  Bittner,  Jr., 

A.  C.  Development  of  performance  eval¬ 
uation  tests  for  environmental  research 
(PETER):  time  estimation  test.  Pro¬ 

ceedings  of  the  23rd  Annual  Meeting  of 
the  Human  Factors  Society,  Boston,  MA, 
513-517,  October  1979. 

22.  Pepper,  R.  L. ,  Kennedy,  R.  S.,  Bittner,  Jr., 

A.  C.,  &  Wiker,  S.  F.  Performance  eval¬ 
uation  tests  for  environmental  research 
(PETER):  Code  substitution  test.  Pro¬ 
ceedings  of  the  7th  Psychology  in  the 
POD  Symposium,  USAF  Academy,  1980. 


23.  Reitan,  R.  M.,  &  Davison,  L.  A.  Clinical 

neuropsychology:  Current  status  and 
applications.  New  York:  John  Wiley. 
1974. 

24.  Seales,  D.  M.,  Kennedy,  R.  S.,  &  Bittner  Jr., 

A.  C.  Development  of  performance  eval¬ 
uation  tests  for  environmental  research 
(PETER):  arithmetic  computation. 
Proceedings  of  the  23rd  Annual  Meeting 
of  the  Human  Factors  Society,  1979. 

25.  Sternberg,  S.  High  speed  scanning  in  human 

memory.  Science,  1966,  153,  652-654. 

26.  Stroop,  J.  R.  Studies  of  inference  in  serial 

verbal  reactions.  Journal  of  Experi¬ 
mental  Psychology,  1935,  JU3,  643-662. 

27.  Underwood,  B.  J.  ,  Boruch,  R.  F.  ,  &  Malmi, 

R.  A .  The  composition  of  episodic 
memory.  (ONR  Contract  No.  N00014-76- 
C-0270)  (NTIS  No.  AD  A040696) . 

28.  Wechsler,  D.  Measurement  and  appraisal  of 

adult  intelligence,  Baltimore,  MD:  The 
Williams  &  Wilkins  Co.,  1939. 

29.  Welkind,  I.,  &  Sprug,  J.  Time  research: 

1,172  studies.  Metuchen,  NJ:  Scarecrow 
Press,  1974. 

30.  Winer,  B.  J.  Statistical  principles  in 

experimental  design  (2nd  ed.).  New 
York:  McGraw-Hill,  1971. 


11 


TABLE  3:  PRELIMINARY  REFERENCE3 

CATALOGUE  OF  15  PERFOR¬ 
MANCE  TASKS  STUDIED  BY  o  a: 

THE  PETER  PARADIGM  g  “ 

o  w 

W  Oh _ _ 

VALIDITIES 

b 

VERIFICA¬ 
TION  c 

INDIV. /GROUP 

TEST  MODE  d 

rEST  TIME 
(SEC) 

SCORE  e 

COMMENTS  f 

DAY  X 

STABILIZES 

bm 

DAY  S.D. 
STABILIZES 

o 

VL- 

DAY  R 

STABILIZES 

TASK  DEF'N. 

STAN.  RELI¬ 
ABILITY  g 
DAY  STABLE 

SENSITIVITY 

h 

GRAMMATICAL  REASONING 

2 

4 

2,4 

2,3 

G 

1 

60 

NC 

23 

4 

4 

10 

.37 

1 

5 

5 

.82 

.93 

5 

++ 

ITEM  RECOGNITION  (a)  SLOPE 

25 

5 

2 

1,3 

I 

3 

900 

S 

21 

2 

2 

45 

0 

2 

60 

- 

- 

.00 

- 

— 

(b)  RT 

25 

5 

2 

1,3 

I 

3 

225 

RT 

21 

2 

4 

700 

0 

2 

20 

3 

.70 

.65 

4 

- 

COMPLEX  COUNTING 

15 

15 

4 

3 

G 

4 

900 

PC 

19 

2,3 

1 

80 

0 

1 

15 

3 

.85 

.  53 

3 

- 

STROOP  (a)  BW  WORDS 

26 

10 

2 

1.2 

3 

G 

5 

30 

NA 

19 

3 

7 

53 

0 

1 

8 

4 

.83 

.92 

7 

++ 

(b)  COLOR  BLOCKS 

26 

10 

2 

1,2 

3 

G 

5 

30 

NA 

19 

3 

7 

47 

0 

1 

9 

2 

.83 

.92 

7 

++ 

(c)  COLOR  WORDS 

26 

10 

2 

1,2 

3 

G 

5 

30 

NA 

19 

3 

7 

40 

0 

1 

10 

4 

.80 

.90 

7 

++ 

AIR  COMBAT  MANEUVERING 

16 

16 

4 

4 

I 

2 

1380 

NC 

15 

2,5 

4 

13 

.5 

1 

3 

£ 

.93 

.70 

6 

+ 

SLALOM 

16 

16 

4 

4 

I 

2 

952 

NC 

15 

2,5 

4 

79 

2 

1 

13 

7 

.60 

.22 

7 

— 

DIGIT  SPAN  (a)  FWD 

28 

20 

1-4 

1-4 

G 

6 

1800 

NC 

9 

3 

20 

.67 

3 

5 

11 

.68 

.20 

11 

— 

(b)  BKWD 

28 

20 

1-4 

1-4 

G 

6 

1800 

NC 

9 

3 

17 

.67 

3 

8 

- 

- 

.30 

- 

— 

CODE  SUBSTITUTION 

28 

22 

1,2 

1-4 

G 

1 

240 

NC 

19 

4 

8 

70 

.8 

8 

10 

8 

.75 

.  70 

8 

+ 

ARITHMETIC 

28 

24 

1,3 

4 

1-4 

G 

1 

600 

NC 

18 

4 

4 

3b 

0 

1 

18 

1 

.94 

.85 

4 

++ 

TIME  ESTIMATION 

29 

21 

1,4 

3 

I 

7 

600 

CE 

19 

5 

1 

8 

0 

1 

3 

- 

- 

.75 

- 

+ 

CRITICAL  TRACKING 

11 

8 

4 

3 

I 

4 

900 

1 

RT 

18 

1,2 

5 

4 

5.5 

.12 

1 

.7 

10 

.85 

.65 

10 

- 

DUAL  CRITICAL  TRACKING 

11 

7 

4 

2 

I 

4 

900 

1 

RT 

12 

1,2 

5 

5 

4.4 

.07 

4 

.8 

10 

.76 

.40 

10 

- 

INTERFERENCE  SUSCEPTIBILITY 

27 

19 

2 

4 

I 

3 

600 

PC 

23 

3 

3 

60 

1.1 

1 

20 

8 

.71 

.45 

8 

- 

TRAIL  MAKING 

23 

17 

1,2 

2 

G 

1 

110 

CT 

18 

5 

110 

0 

2 

22 

2 

.40 

.  50 

5 

- 

TAPPING 

23 

17 

1-4 

1,2 

G 

l 

36 

CT 

18 

1 

36 

0 

1 

5 

2 

.85 

.95 

4 

— 

NOTES: 

a.  See  References 

b .  Val tdf ties: 

c.  Verification: 

d.  Test  Mode: 

e .  Score: 

f .  Comment  a: 

g .  Standardized 
Reliability: 

h .  Sensitivity: 


1-Content,  2-Construct,  3-Predlctlve,  4-Face 

1-Braln  Damage,  2-Human  Information  Processing,  3-Envlronmental  Change, 

4- Factor  Identified  by  Factor  Analysis 

1-Paper  and  Pencil,  2-T.V.  Game,  3-Audioviewer,  4-Speclallzed  Equipment, 

5- Slides,  6-Verbal,  7-Stopwatch 

NC-Number  Correct,  S-Slope,  RT-Reactlon  Time,  PC-Percent  Correct, 

NA-Number  Attempted,  CE-Algebraic  sum  of  timing  errors,  CT-Completlon  Time 

1-Not  Portable,  2-Posslble  electrical  hazard  In  some  environments, 
3-limlted  number  of  forms  available,  4-Computer  programs  available  to 
generate  forms,  5-Self  scoring 

Estimate  of  what  the  reliability  would  have  been  If  the  test  had  lasted 
3  minutes.  Computed  using  the  Spearman-Brown  Formula  (Winer,  1971) 

-H-,  r>.8;  +,  .8>r>.7;  -,  .7>r>.35;  — ,  r<.35 


12 


PROCEEDINGS  OF  THE  SEVENTH  PSYCHOLOGY  IN  THE  DOD  SYMPOSIUM 
USAF  ACADEMY,  COLORADO  SPRINGS,  CO  16-18  APRIL  1980 


Performance  Evaluation  Tests  for  Environmental  Research  (PETER): 

Code  Substitution  Test 

Ross  L.  Pepper 
Naval  Ocean  Systems  Center 

Kailua,  Hawaii 

Robert  S.  Kennedy  and  Alvah  C.  Bittner,  dr. 

Naval  Aerospace  Medical  Research  Laboratory  Detachment 
New  Orleans,  Louisiana  70189 

Steven  F.  Wiker 

Department  of  Industrial  Engineering 
University  of  Michigan,  Ann  Arbor  49107 

Abstract 

A  Code  Substitution  Test  was  considered  for  inclusion  in  the  Perfor¬ 
mance  Evaluation  Tests  for  Environmental  Research  (PF.TKR)  battery.  The 
effects  of  repeated  testing  on  code  substitution  performance  was  studied 
to  determine  reliability  and  stability  of  task  performance.  A  single  two 
minute  testing  trial  per  day  was  administered  to  a  group  of  19  subjects 
for  L 3  consecutive  weekdays.  In  a  second  experiment,  a  four  minute  per 
jay  test  was  administered  to  12  of  the  19  original  subjects  for  an  addi¬ 
tional  15  consecutive  weekdays.  Descriptive  statistics  are  reported. 
Comparisons  are  made  between  these  laboratory  data  and  performances 
assessed  at  sea  with  repeated  administration  occurring  within  each  day. 
The  need  for  knowledge  about  task  stability  over  repeated  performance 
testing  in  exotic  environments  is  discussed.  The  Code  Substitution  Test 
is  recommended  for  inclusion  in  the  PETER  battery. 


A  research  program  is  underway  to  evaluate  tests  of  mental  work  for 
future  use  in  studying  adverse  environments  (Kennedy  N  Bittner,  L977). 
Each  test  is  examined  for  stability  as  it  is  performed  over  periods  of 
extended  practice  (15  days).  Tests  found  to  be  suitablv  stable  and  to 
possess  other  characterist  ics  (Kennedy,  Bittner  &  h'arbeson,  1980)  are 
made  part  of  a  battery  of  Performance  Evaluation  Tests  for  Environmental 
Research  (PETER).  The  present  study  reports  the  findings  for  a  form  of 
the  Code  Substitution  (or  Digit-Symbol)  Test. 

Otis  is  generally  given  credit  for  the  initial  development  of  a 
ligi t-Symbol  Test,  and  with  Terman,  the  evolution  of  group  intelligence 
testing  around  Uorl  >1  War  l  (Wechsler,  195b).  Wechsler  (  1958)  included 
the  Digi t-SvmboL  Test  in  the  original  Wechsler-Rel levue  (W-B)  K*  Test. 

He  felt  tills  inclusion  was  required  because  it  was  one  of  the  oldest  and 
best  established  of  all  psychological  tests.  He  felt  that  the  Digit- 
Symbol  lest  measured  botli  speed  and  power,  and  that  both  should  be  given 
weight  in  the  evaluation  oi  intelligence.  He  reported  high  correlations 


The  (pinions  are  those  of  the  authors  and  do  not  necessarily  reflect 
those  of  the  Department  of  the  Navy, 

This  research  was  performed  under  Navy  Work  Unit  No.  MF58. 524. 002-502 7 . 


between  Digit-Symbol  Test  scores  and  total  IQ  scores  (r  =  .673  for  ages 
20-34;  r  =  .697  for  ages  35-49  (see  Wechsler,  1939  p.  136)).  In  des¬ 
cribing  the  standardization  of  his  test,  Wechsler  reported  split-half 
coefficients  ranging  from  r  =  .83  to  r  =  .90  after  correction  for  atten¬ 
uation.  However,  it  should  be  noted  that  his  standardization  procedure 
was  not  a  conclusive  demonstration  of  either  reliability,  stability,  or 
validity.  Correlations  within  and  between  the  verbal  and  performance 
sub-tests  indicated  the  measurement  of  common  variation  which  could  be 
either  a  common  cluster  of  factors,  correlated  errors  of  measurement 
within  days,  or  both.  Hence,  the  consistency  of  the  Digit-Symbol  Test  is 
not  clear. 

In  addressing  this  issue,  Derner,  Aborn  and  Cantor  (1950)  rightly 
pointed  out  that  the  method  of  choice  for  determining  the  reliability  of 
a  measuring  instrument  is  a  test- retest  technique.  They  then  conducted  a 
test-retest  study  to  assess  changes  over  6  months,  4  weeks,  and  1  week 
using  normal  adults  (n=158).  In  all  sub-tests,  including  Digit-Symbol,  a 
learning  effect  was  apparent.  The  overall  WA1S  reliability  coefficients 
across  test-retest  intervals  varied  from  r=.83  to  r=.88  for  the  perfor¬ 
mance  scale  and  Digit-Symbol  was  r=.80.  This  was  the  first  substantial 
evidence  that  the  Code  Substitution  Test  has  sufficient  reliability  to 
potentially  reflect  changes  with  environmental  manipulations.  It  is 
noteworthy  that  except  for  the  schizophrenic  population  employed  by 
Ragin,  all  adult  reliabilities  on  the  Digit-Symbol  test  surveyed  by 
Derner,  et  al.  (1950)  exceeded  the  mid  .  70's  .  Hence,  the  body  of  lit¬ 
erature  suggests  that  the  Digit-Symbol  test  has  adequate  simple  test- 
retest  reliability. 

The  stability  of  the  Digit-Symbol  test  alone  across  extensive  re¬ 
peated  testing  or  practice,  has  not  been  sufficiently  established  in  pre¬ 
vious  research.  The  most  relevant  study  was  by  Woodrow  (1937)  who  com¬ 
pared  the  performance  of  high  and  low  initial  score  performers  on  a 
variety  of  tests,  including  a  Code  Substitution  Test.  Testing  was  con¬ 
ducted  daily  for  a  10  minute  period  for  39  days  for  one  group  (n  =  56) 
and  for  66  days  for  a  second  group  (n  =  82).  The  initial-final  reliabi¬ 
lity  coefficients  for  code  substitution  were  r  =  .57  for  the  former,  r  = 
.59  for  the  latter.  The  ratio  of  initial  and  final  group  standard  devia¬ 
tions  were  1.57  and  1.64  respectively  for  the  two  groups,  indicating  that 
between  subject  differences  increased  slightly  with  practice,  a  finding 
that  has  been  obtained  elsewhere  (Harbeson,  Kennedy,  &  Bittner,  1979). 

The  extent  to  which  performance  on  a  variety  of  tasks  confounds  findings 
is  not  known.  Therefore,  the  primary  purpose  of  the  present  effort  was 
to  study  code  substitution  in  the  laboratory  under  baseline  conditions 
over  extended  practice.  A  secondary  purpose  was  to  report  the  sensi¬ 
tivity  of  this  test  in  a  field  study. 

Experiment  1 

Method 

Subjects .  Navy  enlisted  men  (n=19)  age  19-24  comprised  the  experi¬ 
mental  group.  These  men  were  recruited,  evaluated  and  employed  in  accor¬ 
dance  with  procedures  described  elsewhere  (Thomas,  Majewski,  Ewing  & 


Pristo  (1978)  has  shown  lower  test-retest  reliabilities  (r  =  .20)  in  40 
children  (IQ  range  52-145)  than  expected. 


14 


Gilbert,  1978).  These  procedures  meet  or  exceed  prevailing  national  and 
international  guidelines  concerning  human  use  in  research.  The  subjects 
received  extra  compensation  for  volunteering  and  appeared  motivated  to 
perform.  They  were  representative  of  the  Navy  population  in  size  and 
intelligence  but  physically  and  mentally  screened  for  hazardous  duty 
environment  research.  They  were  under  continuous  medical  supervision. 

Apparatus .  The  Code  Substitution  test  forms  were  derived  after  the 
concepts  of  Otis,  where  each  day  nine  letters  were  randomly  assigned  a 
digit  from  one  to  nine.  Fifteen  alternate  forms  were  computer  generated 
following  a  general  Monte  Carlo  algorithm:  (a)  the  digit  letter  rela¬ 
tionships  were  changed  daily;  (b)  each  letter  appeared  10-15  times  in  a 
daily  list  of  135  items;  and  (c)  each  letter  was  nonrepeating.  Figure  1 
shows  a  layout  of  a  sample  test  form. 

Procedure .  The  subject’s  task  was  to  follow  the  letter/ number 
correspondence  for  a  given  day  in  assigning  the  appropriate  letter  below 
each  number.  Subjects  were  instructed  to  proceed  rapidly  and  accurately 
throughout  the  list  until  told  to  stop.  Each  session  in  Experiment  1 
lasted  two  minutes.  The  subjects  were  ordinarily  tested  in  a  group  each 
workday  morning  for  three  weeks.  Performance  was  scored  according  to 
number  attempted,  number  correct,  and  rights  minus  wrongs.  Group  means, 
between  subject  standard  deviations,  and  cross  session  reliabilities  were 
calculated  for  each  score.  Analysis  of  variance  (ANOVA)  was  conducted  for 
days  and  subjects  main  effects. 

Results 

Only  results  for  total -correct  are  reported  here  as  the  subjects 
made  very  few  errors,  (1  on  the  average/per  subject/per  day)  and  other 
scores  (e.g.  total  attempted)  were  redundant.  Figure  2  shows  means  and 
standard  deviations  for  total-correct  for  nineteen  subjects  over  15  days. 
Mean  performance  is  seen  to  improve  throughout  the  study,  although  the 
trend  becomes  less  pronounced  after  Day  8.  Similarly,  standard  devia¬ 
tions  increase  but  are  relatively  constant  after  Day  8.  Figure  3  shows 
the  cross  session  reliabilities  for  selected  base  days,  the  source  of 
which  is  Table  1.  Correlation  traces  (Bittner,  1979)  show  negative 
slopes  for  Base  Days  1,  2,  and  4.  This  trend  is  less  evident  in  traces 
for  Base  Days  8,  10,  and  12,  suggestive  of  differential  stabilization 
somewhere  between  Days  4  and  8.  Task  definition  (Jones,  1980),  the 
degree  to  which  a  test  differentiates  reliably  between  individuals,  is 
greater  than  r  =  .75  subsequent  to  Day  8. 

Experiment  2 

Method 

Subjects.  Twelve  of  the  19  original  subjects  comprized  the  experi¬ 
mental  group.  Between  the  end  of  Experiment  1,  and  the  beginning  of 
Experiment  2,  the  other  7  subjects  were  transferred  and  were  not  avail¬ 
able  for  testing. 

Apparatus .  The  test  forms  were  produced  in  the  same  way  as  in 
Experiment  l,  with  the  exception  that  each  day's  test  was  twice  as  long 
(270  vs.  135  items). 

Procedure .  The  procedure  was  the  same  as  Experiment  1,  except  that 
the  subjects  were  given  4  minutes  rather  than  2  minutes  of  testing  each 
day.  The  testing  period  began  11  weeks  after  the  conclusion  of  the  first 
experiment,  and  continued  for  15  consecutive  workdays. 


Results 

Experiment  2  was  conducted  in  an  attempt  at  improving  the  magnitude 
of  the  correlation  level  by  doubling  testing  time.  Although  only  twelve 
of  the  original  19  subjects  remained  available  for  the  retest,  their 
means  (Figure  5)  were  not  statistically  different  from  the  original  group 
(p^.5).  The  second  study  also  was  continued  for  fifteen  days,  and  the 
means  and  standard  deviations  for  these  twelve  subjects  appear  in  Figure 
6.  While  performance  continued  to  improve  over  the  period  of  the  experi¬ 
ment,  the  change  is  slight  but  significant  (p^.01).  Not  unexpectedly, 
the  values  are  about  twice  those  of  the  shorter  test  (cf.  Figure  3). 
Correlations  are  level  for  all  comparisons  indicating  task  stabilization 
was  manifested  on  Day  1  of  Experiment  2.  Task  definition  is  better  than 
with  the  shorter  test  but  slightly  less  than  predicted  by  a  Spearman- 
Brown  adjustment. 

Experiment  3 

Method 

Subjects.  Six  U.  S.  Coastguardstnen  were  selected  from  the  comple¬ 
ment  of  the  WPB  95  (White  Patrol  Boat)  employed  in  this  study. 

Apparatus  and  Procedure.  Testing  materials  and  procedures  were 
similar  to  those  employed  in  Experiments  1  and  2  with  the  following 
exceptions:  Testing  was  conducted  hourly  from  0800-1600  for  four  consecu¬ 
tive  days.  The  testing  compartment  was  located  amidships,  below  decks. 

The  first  two  days  of  testing  were  conducted  dockside,  with  engines 
running.  The  second  two  days  of  testing  occurred  while  the  vessel  steamed 
a  double  octagonal  pattern  seven  miles  southwest  of  Honolulu  in  the 
Molokai  Channel,  an  area  acknowledged  for  its  turbulent  sea  condition. 

The  testing  commenced  each  sea  day  while  the  vessel  steamed  directly  into 
the  primary  swell.  Course  changes  of  45°  were  made  every  half  hour 
throughout  the  day,  creating  a  systematically  changing  motion  environ¬ 
ment.  (See  Wiker  &  Pepper,  1978  for  greater  details  of  the  testing 
conditions  and  a  description  of  other  task  and  subject  variables  assessed 
during  this  phase). 

Results 

Figure  7  shows  performance  on  the  Code  Substitution  Test  for  the  six 
Coastguardmen  exposed  to  mild  seas  in  the  Molokai  channel.  The  data  are 
plotted  as  scores  per  minute  for  the  16  dockside  practice  trials  versus 
the  16  at  sea  data  points.  For  comparability,  the  data  from  the  first 
and  second  laboratory  studies  (Figures  4  and  5)  have  been  replotted  as  a 
function  of  cumulative  practice.  Plotted  in  this  way,  15  days  of  2 
minute  laboratory  trials  can  be  compared  to  the  first  15  hours  of  dock- 
side  testing. 

The  fit  between  the  two  studies  for  the  first  30  minutes  of  practice 
is  surprisingly  good  considering  the  known  differences  in  the  two  experi¬ 
ments:  (a)  design  -  all  performance  massed  in  4  days  versus  distributed 

over  two  3-week  periods  11  weeks  apart;  (b)  test  length  -  2  and  4  minute 
trials  were  combined  in  the  laboratory  study  versus  two  minute  trials 
only  in  the  field  study;  and  (c)  subjects  -  Navy  versus  Coastguardsmen. 
Secondly,  the  fit  is  also  good  during  the  sea  trials  with  the  exception 
of  the  second  hour  at  sea  where  the  poorest  performance  of  all  was  ob¬ 
tained.  This  finding  of  performance  degradation  is  concordant  with  the 
high  motion  sickness  symptoms  during  this  time  frame  (Wiker,  Kennedy, 


16 


McCauley  &  Pepper,  1979).  Moreover,  because  of  the  stability  and  differ¬ 
entiation  of  the  laboratory  version  of  the  task  and  the  close  agreement 
between  the  two  studies  after  the  at  sea  decrement,  the  authors  are 
inclined  to  consider  this  a  real  effect  of  motion  on  performance. 

Discussion 

The  PETER  Program  is  underway  whereby  psychological  tests  are  being 
examined  critically  to  determine  their  suitability  for  use  in  detecting 
performance  degradation  in  novel  environments  (Kennedy  &  Bittner,  1977). 
The  criteria  against  which  tests  are  compared  focus  on  stability  and 
sensitivity.  Stability  is  measured  by  examining  the  effects  of  extended 
practice  on  means,  standard  deviations  and  cross  session  reliabilities. 
Means  are  stable  if  they  are  level,  asymptotic  or  exhibit  constant  slope. 
Standard  deviations  may  be  level  or  increase  slightly  with  the  mean. 

Cross  session  reliabilities  are  considered  stable  after  they  cease  to 
change  over  sessions.  In  this  study,  qualify  for  the  PETER  battery. 

Means  have  constant  slope  after  Day  8  of  Experiment  1  and  standard  devia¬ 
tions  are  also  level  after  that  time.  The  reliabilities  are  moderate  r 
.75  and  stable  after  Day  8.  Experiment  2  showed  several  things:  (a) 
stability  is  still  present  3  months  later;  (b)  a  test  twice  as  long 
only  improves  reliability  to  an  average  of  r=.80  while  effectively  doub¬ 
ling  mean  performance.  This  Code  Substitution  Test  appears  to  be  an 
excellent  candidate  for  inclusion  in  PETER  from  the  laboratory  results. 

The  results  of  the  sea  trials  in  this  study  provide  at  the  same  time 
vindication  and  validation  of  the  PETER  paradigm.  The  laboratory  task 
sufficiently  differentiates  subjects,  and  is  stable,  so  that  slight 
departures  may  be  ascribed  as  due  to  environmental  and  not  artifactual 
variables.  The  benefit  of  being  able  to  compare  real  world  performances 
at  sea  with  those  of  a  control  group  in  a  laboratory  is  also  noteworthy. 
Both  laboratory  and  environmental  results  recommended  the  use  of  the  Code 
Substitution  Test  in  PETER  or  other  environmental  batteries. 

References 


Bittner,  Jr.,  A.  C.  Tests  of  differential  stability.  Proceedings  of  the 
23rd  Annual  Meeting  of  the  Human  Factors  Society,  Boston,  October, 
1979. 

Derner,  C.  F.,  Aborn,  M.  and  Cantor,  A.  M.  The  reliability  of  the 
Wechsler-Bellevue  Subtest  and  Scales.  Journal  of  Consulting 
Psychology,  1950,  14 ,  172-179. 

Harbeson,  M.  M.  Kennedy,  R.  S.  &  Bittner,  A.  C.,  Jr.  A  comparison  of  the 
Stroop  Test  to  other  tasks  for  studies  of  environmental  stress. 
Proceedings  of  the  12th  Annual  Meeting  of  the  Human  Factors 


Association  of  Canada,  Bracebridge,  Ontario,  Canada,  September, 
1979. 

Jones,  M.  B.  Stabilization  and  task  definition  in  a  performance  test 
battery.  (NAMRL  Monograph  No.  27).  Pensacola,  FL:  U.S.  Naval 
Aerospace  Medical  Research  Laboratory,  1980. 

Kennedy,  R.  S.  &  Bittner,  A.  C.,  Jr.  The  development  of  a  Navy  Perfor¬ 
mance  Evaluation  Test  for  Environmental  Researcli  (PETER).  In, 
Productivity  Enhancement:  Personnel  Performance  Assessment  in 


17 


Navy  Systems.  Symposium  presented  at  the  Naval  Personnel  Research 
and  Development  Center,  San  Diego,  CA,  12-14  October,  1977.  (NT1S 
No.  AD-056047). 

Kennedy,  R.  S.,  Bittner,  A.  C.,  Jr.,  &  Harbeson,  M.  M.  An  engineering 

approach  to  the  standardization  of  Performance  Evaluation  Tests  for 
Environmental  Research  (PETER).  Proceedings  of  the  11th  Annual 
Conference  of  the  Environmental  Design  Research  Association, 
Charleston,  S.C.,  March,  1980. 

Pristo,  L.  J.  Comparing  WAIS  and  WISC-R  Scores.  Psychological  Reports, 
1978,  42,  515-518. 

Thomas,  D.  J.,  Majewski,  P.  L. ,  Ewing,  C.  L.  &  Gilbert,  N.  S.  Medical 
Qualification  Procedures  for  Hazardousduty  Aeromedical  Research. 
(Conference  Proceedings  No.  231,  A3,  pp.  1-13,  1978)  London: 

AGARD,  1977. 

Wechsler,  D.  Measurement  of  Adult  Intelligence  (1st  ed.).  Baltimore: 
Williams  &  Wilkins  Co.,  1939. 

Wechsler,  D.  The  Measurement  and  Appraisal  of  Adult  Intelligence. 
Baltimore:  The  Williams  &  Wilkins  Co.,  1958. 

Wiker,  S.  F.  &  Pepper,  R.  L.  Change  in  crew  performance,  physiology  and 
affective  state  due  to  motions  aboard  a  small  monohull  vessel;  a 
preliminary  study.  Coast  Guard  Technical  Report  No.  CG-D-75-78, 
1978.  . . 

Wiker,  S.  F.,  Kennedy,  R.  S.,  McCauley,  M.  E. ,  &  Pepper,  R.  L.  Suscepti¬ 
bility  to  seasickness:  Influence  of  hull  design  and  steaming  direc¬ 
tion.  Aviation,  Space  &  Environmental  Medicine.  1979,  50_,  1046-1051. 

Woodrow,  H.  Factors  in  improvement  with  practice.  The  Journal  of 
Psychology,  1937,  7_,  55-70. 

Table  Figures 


Table  1 

Experiment  1:  Comparison  of 
Reliabilities  for  Total  Correct 
Over  15  Days  (n=18) 


.60  .SO  .23  .68  .7* 

.7*  .33  .22  .60 

.77  .11  .42 


10  11  12  1}  U  13 


39  .22 

61  .41 

24  .V 
*2  .62 
80  .20 
78  .40 


2?  .71  .73 


LICIT  (3)  (7)  ,  (o.i  (1  ■  i-i  i  l.  ■ 

.ll-JHHXJX7.HSFp.M~ 
(><>(><)<)<><><>()(  1  (  )  ,  )  (  >  *  )  (’  ; 

MJSJHXB7.ll.  ’mi  ,,  v 
<)<>(>(>()(><)(>(>()(  }  (  )  (  )  f  )  (  , 

HSZM2SLBMMX7HH  p 
(><><><><><)()<)<><><>(>(),><> 

XSHM7HZSI.  BSKFS  x 
<><><>  C)  <)()(>()  <)<)  (»<j  <><»(> 

.»*l'S?Sr»ZHJ2B7j 

<><)<><><><>(>(>()(>()()<>()<> 

XSXSZLFSHJMSMMF 

<><><><)()()<)(><)<)(){)()()() 

t  »  *  »  H  S  L  J  X  S  I.  SFLS 

l  ><)(>()(  »  C  ><><>(><><><>()()<  ) 

LZFSZLJXbSFbl.  SJ 
<)(>()<)()()(>()()(){)()<>()<) 

tHHHXHXFXM.ISJML 
<><><><'<><><><>(><><  J  (><)<><  ) 


Figure  1.  Coding  Test  Sample  Form 
(Day  1,  Experiment  1) 


Figure  2.  Experiment  1:  Means 
and  Standard  Deviations  for 
Total  Correct  Over  15  Days 
(n-19). 


Figure  3.  Experiment  1:  Corre¬ 
lations  Between  Selected  Base 
Day s  (1,  2,  4 ,  8,  10,  12)  and 
Those  Following  For  Total  Correct 
Over  15  Days  (n=19). 


BATS 


Figure  4.  Experiment  1:  Means  and 
Standard  Deviations  for  Total 
Correct  Over  15  Days  (n=12). 


Figure  5.  Experiment  2:  Means 
and  Standard  Deviations  for 
Total  Correct  Over  15  Days  (n=12). 


*  7  a  9  19  ,i  i-ii3  1, 


BITS  IFTEB  BASE  P  tllF  OCM  A  MCE 

Figure  6.  Experiment  2:  Correla¬ 
tions  Between  Selected  Base  Days 
(1,  2,  4,  8,  10,  12)  and  Those 
Following  for  Total  Correct  Over 
15  Days  (n=12''. 


4 -  ■IPUIHIPT  J 


ID  r 

U - DOCSilDl - *  * -  AT  AX  A - — 


1  10  10  '0  40  40 

minus  ar  nicnci 


Figure  7,  Experiments  1,  2,  and 
3:  Mean  Number  Attempted  In  1 
Minute . 


Proceedings  of  the  12th  Annual  Meeting  of  the  Human  Factors  Association  of  Canada 
Bracebridge,  Ontario,  Canada,  September,  1979 


A  COMPARISON  OF  THE  STROOP  TEST  TO  OTHER  TASKS  FOR  STUDIES  OF  ENVIRONMENTAL  STRESS 
Mary  M.  Harbeson,  Robert  S.  Kennedy  and  Alvah  C.  Bittner,  Jr. 

Naval  Aerospace  Medical  Research  Laboratory'  Detachment 
P.  0.  Box  29407,  New  Orleans,  Louisiana  70189 

ABSTRACT 

A  program  is  underway  to  standardize  a  battery  of  Performance  Evaluation  Tests  for  Environmental 
Research  (PETER).  The  purpose  of  the  program  is  to  develop  a  test  battery  which  will  measure  the  effects 
of  extended  exposure  to  unusual  environments  (e.g.  ship  motion  and  vibration)  on  the  performance  of  U ,  S. 
Navy  personnel.  Tasks  which  meet  one  or  more  of  the  following  criteria  are  being  examined:  sensitivity 
to  unusual  environments,  diagnostic  capability  for  brain  damage,  or  the  ability  to  measure  some  aspect  of 
information  processing.  The  strategy  for  developing  PETER  has  been  to  administer  each  task  for  IS  conse¬ 
cutive  work  days  to  the  same  group  of  20  men  who  serve  as  volunteer  subjects,  and  to  examine  the  stabil¬ 
ity  of  the  means,  variances  and  reliabilities.  These  statistics  thus  become  specifications  which  may  be 
employed  to  evaluate  and  compare  the  suitability  of  tasks  for  inclusion  in  a  test  battery.  This  report 
focuses  chiefly  on  the  Stroop  Test,  and  describes  our  approach,  in  detail.  The  S troop  specifications  are 
compared  with  "good”  and  "bad"  tasks  from  our  recent  experiment •>.  The  tests  used  for  comparison  are: 
complex  counting,  critical  tracking,  time  estimation,  arithmetic  and  air  combat  maneuvering.  Examples  of 
tests  -which  are  unsuitable  because  of  failure  to  meet  only  one  of  the  three  criteria  are  shown.  The 
importance  of  the  stability  of  the  reliability,  heretofore  ignored  in  performance  test  battery  construc¬ 
tion  is  discussed. 


INTRODUCTION 

PETER  Paradigm 

At:  experimental  program  for  the  development  of 
Performance  Evaluation  Tests  for  Environmental  Research 
tPEiER.j  Is  currently  underway  at  the  Naval  Aerospace 
Medical  Research  Laboratory  (NAMRL)  (Kennedy  &  Bittner, 
1 97 7 ,  1978).  The  purpose  of  the  program  is  to  develop 
a  test  battery  to  determine  if  human  performance  is 
disrupted  bv  the  unusual  environmental  conditions 
experienced  by  Navy  personnel  (e.g.,  ship  motion  and 
vibration!  over  extended  exposures.  The  program  is 
designed  to  resemble  an  engineering  test  and  evaluation 
program,  since  each  test  or  element  is  subjected  to  an 
analysts  of  its  performance  specifications. 

Specifically,  baseline  measures  of  performance  are 
obtained  In  a  series  of  tests  administered  for  15 
consecutive  weekdays  to  the  same  group  of  subjects. 
Three  statistical  criteria  are  being  considered  in  the 
evaluation  of  the  suitability  of  a  test  for  use  in  un¬ 
usual  environments,  viz.  means,  variances,  and  cross 
session  reliabilities.  Whereas  stable  means  are 
intuitively  desirable  for  the  study  of  environmental 
effects,  we  feel  that  other  approaches  based  upon  rel i- 
ability  are  more  relevant.  For  example,  when  subjects 
serve  as  their  own  control,  task  reliability  can 
sharpen  the  Student's  t  Test  by  reducing  the  standard 
error  which  appears  as  the  denominator  in  equation  (1). 


In  particular,  It  may  be  seen  that  as  the  correlation 
approaches  unity  and  the  variances  remain  equivalent, 
the  denominator  of  (l)  will  approach  zero,  other 
statistical  considerations  also  direct  attention  to 


reliability  as  a  criterion.  Simple  repeated  measures 
analysis  of  data,  in  particular,  require  stability  of 
reliabilities  across  trials  (Jones,  1979;  Bittner. 
1979).  Hence.,  the  strategy  for  building  PETER,  in 
addition  to  monitoring  changes  in  the  means  and  vari¬ 
ance  has  been  to  focus  on  the  stability  of  the  reli¬ 
abilities  of  a  test  over  many  sessions.  Each  of  these 
statistical  criteria  warrants  separate  discussion. 

Means .  it  is  felt  that  there  are  three  criteria 
for  mean  stability:  (1)  Plateaus  are  most  desirable 
but  they  occur  infrequently  (cf.  e.g.,  Kennedy  & 
Bittner,  1978);  (2)  Asymptotic  means  are  acceptable 
but  are  not  always  obtained  even  when  practice  is 
extended  (Bran  Ley,  1969);  and  (3)  Tones  (1979)  has 
suggested,  that  a  slow  regular,  linear  increase  ocet 
sessions  also  reflects  stability. 

Standard  deviations.  Whereas  the  with  in-sub ject 
variance  can  be  expected  to  oec reuse  with  practice,  it 
is  the  between  subject  variances  which  are  listed  in 
equation  ■ 1 ) .  These  between-sub ject  variances  may  be 
considered  stable  when  they  are  constant.  In  addi¬ 
tion,  as  the  mu a ns  increase,  it  is  possible  that 
standard  deviations  will  also  increase  with  practice. 
(Jones,  19" 2,  p  109).  Standard  deviation  stability  is 
considered  to  be  present  in  this  latter  case  if  there 
is  a  concordant  stabilization  in  the  means  and  corre¬ 
lations. 

Co  r  rel  a  tiers  .  Since  at  least  the  time  of  Peri 
(1934),  it  has  been  known  that  during  the  acquisition 
phases  oi  practice,  the  cross  session  reliabilities 
can  be  expected  to  change.  This  change  takes  the 
general  form  of  a  decrease  along  any  row  in  the  corre¬ 
lation  matrix  beginning  with  the  superdt agonal  ,  and 
has  been  inferred  tv'  as  Simplex  form  (Humphreys,  I960; 
Jones,  19b!0.  it  has  bren  inferred  that  when  these 
correlations  cease  to  change  within  the  matrix,  then 
the  task  has  dif ferent 1 al 1 v  stabilized  (Jones,  19^9). 
We  concur  with  this  criterion  and  employ  graphical 
analvsis  r.  detemim  where  and  if  stabilization  is 
•obtained.  To;-  lev.  1  at  which  a  task  .  i  f  f  e  r*-ut  l  ates 
subjects  after  it  has  stabilized  is  also  an  important 


*  This  research  was  per  funned  under  Navy  Contract  No.  MF58.  524.  '*07-'  -'7  / .  Hu-  «- aj ;» .’mts  a  -e  tlro-a-  •;  :'•«  r-t';.  r  s, 

and  do  not  necessarily  reflect  those  of  the  Department  of  the  Naw . 


20 


r 

■ 


factor  Involving  the  cross  session  reliabilities.  High 
correlations  obviously  are  most  desirable  and  r=.707  is 
considered  to  be  a  lower  limit  for  inclusion  in  PETER. 

Task,  selection  criteria.  Candidate  subtasks  must 
meet  one  or  more  of  the  following  criteria  in  order  to 
be  evaluated  for  inclusion  in  PETER:  sensitivity  to 
unusual  environments;  neuropsychological  diagnostic 
capability;  the  ability  to  measure  some  aspect  of 
information  processing;  and  practical  (e.g.,  cost) 
considerations  (Kennedy  &  Bittner,  1977 >.  The  Scroop 
Test  (Stroop,  1935)  has  been  reviewed  extensively 
(Jensen  &  Rohwe*',  1966;  Dyer,  1973).  It  was  chosen  for 
study  since  it  met  all  our  criteria  for  test  selection. 

The  Stroup  Test 

Background .  The  Stroop  test  has  been  applied  in  a 
wide  variety  of  investigations  and  will  be  use!  as  an 
example  n  the  analysis  applied  in  the  PETER  program. 

It  has  been  used  as  a  measure  of  psychological  stress 
in  environmental  studies  (Reilly  &  Cameron,  1968; 
iiersner  &  Cameron,  1970;  Schilling,  Werts  & 

Schande Imeier ,  1976;  Allan,  Gibson  6  Green,  1979). 
Further,  it  has  been  shown  to  be  sensitive  to  age, 
drugs,  psychiatric  disturbance  and  organic  brain  damage 
(Jensen  &  Rohwer,  1966;  Comal li,  Wapner  &  Werner,  1962; 
Dyer,  1973).  It  has  frequently  been  used  in  the  study 
of  information  processing  functions  (Stroop,  1935, 

193d;  Jensen  A  Rohwer,  1966;  Dyer,  1973;  Rost.,  197 A; 
Williams,  197  7).  In  addition,  the  Stroop  test  has  many 
attractive  practical  features.  It  can  be  group  admini¬ 
stered,  takes  very  littl®  time,  and  the  apparatus  is 
simple,  economical  and  portable. 

The  Stroop  test  is  reported  to  provide  measures  of 
individual  differences  on  three  factors:  a  speed 
factor,  color-naming  facility,  and  (of  greatest  interest 
to  investigators)  interference  proneness  (Jensen  & 
Rohwer,  1966).  The  interference  score,  or  "Stroop 
phenomenon",  is  the  increase  in  reaction  time  between 
naming  a  color  and  naming  the  color  of  words  printed  in 
incompar ible  colors.  This  score  is  described  as  an 
index  of  susceptibility  to  mild  stress  (Thurstone  & 

Mel  linger,  1953,  cited  in  Jensen,  1966;  Sarmany,  1977) 
or  the  ability  to  resist  distraction  (Comalli,  Wapner  & 
Werner,  1962)  although  the  generality  of  this  finding 
has  yet  to  be  demonst rated .  The  psychological  charac¬ 
teristics  of  the  Stroop  appear  to  be  primarily  in  the 
cognitive  realm.  (Dyer,  1973;  Golden,  Marsella  6  Golden, 
1975;  Jensen  6  Rohwer,  1966;  Sarmany,  1977).  Stated 
differently,  individual  differences  In  the  "Stroop 
phenomenon"  are  most  likely  related  to  differences  in 
perceptual  style. 

In  summary,  there  is  sufficient  research  to  sug¬ 
gest  that  performances  on  the  Stroop  Test  tap  an  impor¬ 
tant  faculty  of  an  individual.  Moreover,  it  can  be 
inferred  that  this  faculty  is  related  to  the  work  that 
Naval  personnel  perform  during  the  course  of  their 
motion  exposures,  at  sea  and  in  flight.  Regardless  of 
whether  the  faculty  is  called  interference  proneness  or 
stress  susceptibility,  it  remains  to  be  determined 
whether  this  faculty  is  an  enduring  aspect  of  an  in- 
d  ividual . 

Alternate  forms.  Many  versions  of  the  Stroop  Test 
are  available  but  most  use  the  following  three  condi¬ 
tions:  (a)  black  and  white  words  (BW)  -  color  names 

written  in  black  and  white;  (b)  color  blocks  (CB)  - 
blocks  of  color  (usually  red,  blue  and  green)  contained 
in  a  single,  specified  shape;  and  (c)  a  color-word 
condition  (CW)  -  color  names  written  in  Incompatible 
colors  (e.g.,  the  word  "red"  printed  in  blue).  tn 
previous  research,  methods  of  admin! st rat  ion  and 
scoring  have  varied  but  the  interference  effect  has 


still  been  obtained.  Subjects  have  been  required  to 
make  verbal  responses  or  manual  responses  (Flowers  6 
Stoup,  1977;  Jensen  6  Rohwer,  1966),  such  as  key 
pressing  (Keele,  1°72)  or  card  sorting  (Stroop,  1938). 
Individual  and  group  administration  have  been  employed 
(Golden,  1975  ;  Jensen  £»  Rohwer,  1966).  Mode  of  presen¬ 
tation  and  arrangement  of  stimuli,  as  well  as  number 
of  colors  (Golden,  1974;  Jensen  &  Rohwer,  1966;  Williams, 
1977)  have  also  been  varied.  Numerous  (^20)  scoring 
methods  have  been  developed  by  the  many  i nvest igators 
who  have  employed  the  Stroop  Test  (Jensen,  1965). 


Adaption  for  PETER.  Tn  order  to  adapt  the  Stroop 
Test  for  environmental  testing,  group  administration 
with  manual  responses  was  selected,  and  slides  were 
used  for  presentation.  The  arrangement  of  stimulus 
material,  conditions,  colors,  and  method  of  scoring 
were  those  used  most  commonly  in  other  studies  (Jensen 
&  Rohwer,  1966;  Jensen,  1965;  Dyer,  1973). 

Other  Tests 

The  Stroop  Test  results  were  compared  with  those 
obtained  on  five  other  tasks  which  have  also  been 
studied  for  inclusion  in  PETER:  complex  counting, 
critical  tracking,  time  estimation,  arithmetic  compu¬ 
tation,  and  air  combat  maneuvering.  All  tests  were 
administered  and  analyzed  according  to  the  PETER 
paradigm.  In  the  complex  counting  test  (Kennedy  6 
Bittner,  1979),  subjects  listened  to  three  tones 
played  simultaneously,  and  were  required  to  keep  track 
of  every  fourth  low  and  medium  tone.  For  the  critical 
tracking  test,  (Damos,  Kennedy  &  Bittner,  1979)  the 
apparatus  was  a  replication  of  that  used  by  Jex, 
McDonnell  &  Phatak,  1966.  In  the  time  estimation  test 
(McCauley,  Kennedy  &  Bittner,  1979),  subjects  produced 
time  intervals  by  verbal  request.  The  arithmetic  test 
(Seales,  Kennedy  &  Bittner,  1979)  was  comprised  of  a 
paper  and  pencil  presentation  of  simple  arithmetic 
operations.  The  air  combat  maneuvering  test  (Jones, 
Kennedy  &  Bittner,  L 9 79)  was  an  adaptation  of  an 
Atari  Video  Game  (Atari,  1977)  in  which  the  subjects 
attempted  to  hit  a  moving  drone  with  a  missile. 

Purpose 

The  purpose  of  this  study  was  to  determine  the 
suitability  of  a  group  administered  form  of  the  Stroop 
Test  by  examining  the  effects  of  many  sessions  on  the 
reliability,  variability,  and  mean  performance  of 
three  basic  scores  (BW,  CB  and  CW)  and  two  derived 
scores  (BW-CB  and  CB-CW) .  The  Stroop  "specifications" 
were  then  compared  to  those  of  five  other  tests  pre¬ 
viously  studied  in  the  PETER  program:  complex 
counting,  critical  tracking,  time  estimation,  arith¬ 
metic  computation,  and  air  combat  maneuvering. 

METHOD 

Subjects 

The  subjects  were  a  group  of  19  Navy  enlisted 
men,  ages  19  to  24,  who  had  served  as  volunteer  sub¬ 
jects  in  several  biodynamics  studies  since  induction 
into  the  Navy  (approximately  18  months  prior  to  the 
testing).  To  qualify  for  this  medical  research  pro¬ 
gram,  they  had  to  be  equal  or  above  the  norms  for  Navy 
enlisted  personnel  in  physical  health,  mental  health 
and  intelligence.  All  volunteer  subjects  were  re¬ 
cruited,  evaluated  and  employed  in  accordance  with 
procedures  specified  in  Secretary  of  the  Navy 
Instruction  3900.39  and  Bureau  of  Medicine  and  Surgery 
Instruction  3900.6  which  are  based  upon  voluntary 
informed  consent,  and  meet  the  provisions  of  pre¬ 
vailing  national  and  international  guidelines.  A 


description  of  the  subject  selection  procedure  is 
given  by  Thomas,  Majewski,  Ewing  and  Gilbert  (1977). 

Appa  ratns 

Slides  (35  mm)  were  used  to  present  the  stimulus 
material  for  the  three  conditions,  BW,  CB  and  CW.  The 
items  on  each  slide  were  arranged  in  a  10  X  10  matrix 
of  evenly  spaced  rows  and  columns.  The  colors  red, 
blue  and  green  were  used.  Rectangles  of  color  were 
used  for  the  CB  slide.  Items  on  all  cards  were  in 
random  order.  There  were  two  alternate  forms  for  each 
condition.  The  slides  were  presented  by  means  of  a 
Kodak  Carousel  Projector  (750H),  and  projected  on  a 
L.4  5M  X  1.32M  movie  screen  which  was  placed  approxi¬ 
mately  3  meters  from  the  subjects  who  were  seated  in 
armchair  desks.  Subjects  responded  by  pushing  buttons 
labeled,  left  to  right,  "R"  for  red,  "B"  for  blue,  and 
"G"  for  green,  which  were  located  on  small  switch  boxes 
that  were  placed  on  each  desk  top.  Subjects  responses 
were  automatically  recorded  on  instrument  chart  paper. 

A  Kronos  stopwatch  was  used  to  regulate  both  the 
slide-viewing  time  and  the  inter-trial  interval.  The 
arrangement  of  the  apparatus  provided  for  testing  in 
groups  of  four. 

Procedure 


The  two  alternate  forms  for  each  condition  were 
arranged  in  eight  possible  combinations.  A  different 
order  ot  presentation  was  used  each  day  for  eight  days 
and  seven  of  the  combinations  were  repeated,  one  for 
each  day,  for  the  last  seven  days  of  testing.  In  the 
initial  experimental  session,  after  extensive  practice 
on  the  use  of  the  response  keys,  the  subjects  were 
instructed  to  begin  responding  to  each  slide  imme¬ 
diately  after  it  appeared  on  the  screen.  Instructions 
to  the  subjects  for  each  of  the  3  slides  in  the  order 
in  which  they  appeared  were:  (a)  BW  -  to  push  the 
buttons  corresponding  to  the  color  names  as  they 
appeared;  (b)  CB  -  to  push  the  buttons  corresponding  to 
the  color  blocks  as  they  appeared;  (c)  CW  -  to  push  the 
button  corresponding  to  the  color  that  each  word  w- s 
written  in,  regardless  of  the  color  that  the  word  des¬ 
cribed.  Each  of  the  slides  remained  on  the  screen  for 
30  seconds  and  the  inter-trial  interval  was  5  seconds. 
The  same  procedure,  with  the  exception  or  abbreviation 
of  instructions,  was  followed  on  subsequent  testing 
days.  The  response  measure  was  the  number  of  responses 
in  30  seconds  for  each  condition. 

RESULTS 


Figure  1  shows  mean  performance  for  the  three 
directly  measured  scores  (BW,  CB  &  CW)  and  the  two 
derived  scores  (BW-CB  and  CB-CW).  The  overall  im¬ 
pression  for  the  directly  measured  scores  is  of 
learning  curves  which  are  near  asymptote  after  Day  10. 
The  two  derived  scores,  CB-CW  and  BW-CB,  appear  to 
approach  an  asymptote  subsequent  to  Day  6.  Mean  re¬ 
sponses  for  BW  and  CB  were  greater  than  CW  throughout 
the  test.  Standard  deviations  for  the  three  direct  and 
two  derived  scores  are  given  in  Figure  2.  It  may  be 
seen  that  the  direct  scores  appear  relatively  stable 
and  appear  to  covary  with  the  means  in  Figure  1.  In 
other  words,  there  is  slightly  more  varlablity  as  the 
mean  responses  increase,  following  the  general  rule 
described  by  .Jones  (1972).  Standard  deviations  for  the 
two  derived  scores  appear  nearly  level.  A  two-way 
analysis  of  variance,  repeated  measures  design,  showed 
significant  days  (practice)  and  subjects  effects  for 
all  scores  (jD\10  )• 

Tables  l  through  5  contain  the  correlations  (reli¬ 
abilities)  over  15  days  for  the  direct  and  derived 
Stroop  scores.  Figures  3  through  7  were  drawn  from 


these  tables  and  show  reliability  "traces"  for  selected 
Base  Days  (1,  2,  4,  9  and  13)  for  the  five  scores. 

Trace  plots  were  made  of  the  correlations  of  each  base 
day  with  those  following,  i.e.,  (Base  Day  1  with  2,  1 
with  3,  1  with  4  ...,  1  with  15;  Base  Day  2  with  3,  2 
with  4,  2  with  5  ...  2  with  15,  etc.)  A  fuller  des¬ 
cription  of  the  construction  and  intrepretat ion  of 
this  type  of  plot  Is  given  elsewhere  (Bittner,  1979). 
Examining  these  figures,  it  may  be  seen  that  BW 
(Figure  3),  CB  (Figure  4)  and  CW  (Figure  5),  reli¬ 
abilities  are  relatively  high  after  the  early  base 
days.  For  example,  on  BW  the  correlations  of  Base  Day 
1  to  the  days  after  base  performances  ranges  between 
r  =  .5  and  r  =.7,  while  the  correlation  of  Base  Day  9 
to  subsequent  days  is  of  the  order  of  r  =  .9.  CB 
(Figure  4)  proved  to  be  most  reliable  with  virtually 
all  correlations  of  base  days  to  subsequent  days 
ranging  from  about  r  *  .75  to  r  *  ,96.  BW  was  more 
reliable  than  CW  for  early  days  after  base  perfor¬ 
mance,  but  there  is  a  more  pronounced  decline  in 
reliability  for  CW.  The  derived  scores,  BW-CB  (Figure 
6)  and  CB-CW  (Figure  7)  proved  to  be  relatively  un¬ 
reliable,  mutually  ranging  from  the  high  of  r  *  .59  to 
zero . 


The  results  of  the  five  tasks  which  were  compared 
to  the  Stroop  Test  are  summarized  in  Table  6.  The 
means,  standard  deviations  and  correlations  for 
selected  days  are  shown  in  Tables  8  through  17. 

DISCUSSION 


From  graphical  analyses  of  the  basic  scores  (BW, 
CB,  and  CW) ,  it  is  apparent  that  these  means  and 
standard  deviations  are  virtually  stable  after  the 
initial  base  day's  practice.  In  general  it  would  also 
appear  that  a  relatively  stable  and  satisfactory  level 
of  reliability  is  available  for  all  three  of  these 
measures  subsequent  to  the  early  base  days'  practice. 
The  means  and  standard  deviations  of  the  derived 
scores  (BW-CB  and  CB-CW)  (Figures  667)  also  show 
invariant  behaviors  over  15  sessions,  but  the  reli¬ 
abilities  were  extremely  low. 

It  is  possible  that  the  reliability  of  the 
derived  scores  could  be  increased  by  making  some 
changes  in  the  administration  of  the  test.  A  longer 
session,  each  performance  day,  could  be  expected  to 
raise  reliabilities,  perhaps  with  greater  spacing 
between  the  BW,  CB  and  CW  tasks.  It  is  also  possible 
that  the  amount  of  interference  could  be  increased  by 
changing  the  test  in  other  ways.  It  has  been  found 
in  previous  studies  that  when  motor  rather  than  verbal 
responses  are  required,  the  color  naming  response  is 
greater  than  the  reading  response  (Flowers  6  Stoup, 
1977;  Keele,  1972;  Stroop,  1938).  In  the  present 
study,  the  response  keys  were  marked  with  letters, 
thus  combining  reading  and  manual  responses  ini¬ 
tially,  although  the  letter-color  relationships 
were  considerably  over-learned.  Perhaps  a  purer 
measure  of  interference,  and  greater  reliabilities 
•in  the  derived  scores,  could  be  obtained  by 
changing  the  response  requirement  to  verbal  rather 
than  manual.  This  modification  would  limit  the 
usefulness  of  the  test  for  environmental  test 
purposes;  however,  since  group  administration  is 
of  considerable  practical  importance  (Kennedy  6 
Bit  f ner,  1977) . 

Regardless  of  whether  or  not  the  reliability  of 
the  derived  scores  could  be  improved  by  changing  the 
testing  procedure,  the  important  point  is,  that  the 
problem  could  not  have  been  identified  without 
examining  all  three  statistical  criteria.  To  further 
illustrate  the  importance  of  this  type  of  analysis, 
and  to  demonstrate  the  possible  combinations  of  means. 


22 


standard  deviations  and  correlat ions,  the  Stroop 
results  were  compared  to  five  other  tasks  which  have 
been  studied  in  the  PETER  program. 

The  analyses  of  the  five  other  tests  (Figures 
8-17)  follows  the  same  paradigm  as  shown  for  the  five 
Stroop  scores.  These  five  tests  were  selected  from 
over  50  experiments  since  they  contained  examples  of 
our  major  findings  to  date  concerning  task  stabili¬ 
zation.  Stabilities  of  means,  standard  deviations  and 
cross  session  reliabilities  were  judged  according  to 
the  criteria  listed  previously.  These  judgments  are 
summarized  in  Table  6.  It  Is  our  opinion  that  the  most 
important  finding  is  this  table  in  that  means  alone 
(even  means  ♦  standard  deviations)  are  inadequate  for 
determining  stability.  This  finding  achieves  greater 
importance  when  viewed  in  connection  with  the  scien¬ 
tific  literature  which  reports  performances  in  exotic 
environments.  It  is  quite  possible  that  no  experiment 
has  ever  been  performed  in  an  unusual  environment 
whereby  adequate  task  stabilization  was  obtained  in  the 
pretest  condition. 

In  summary,  a  group  form  of  the  Stroop  Test  was 
administered  according  to  the  PETER  paradigm.  Means, 
standard  deviations  and  correlations  were  examined  and 
compared  with  those  from  five  other  tasks.  If  was 
concluded  that  the  three  basic  scores  of  the  Ft roop 
Test  (BW,  CB  and  CW)  appear  to  be  acceptable  for  in¬ 
clusion  in  PETER.  However,  the  lack  of  derived  score 
reliabilities  suggests  that  neither  of  these  scores  in 
their  present  form  characterize  a  sufficiently  stable 
faculty  of  mental  work  to  be  useful  in  the  study  of 
unusual  environments. 

REFERENCES 

Allan,  J.  R.  ,  Gibson,  T.  M.  6.  Green,  K.  G.  Effect  of 

Induced  cyclic  changes  of  deep  bodv  temperature  on 
task  performances.  Aviation,  Spa.-e,  .in," 
Environmental  Medicine,  1979,  50,  585-589. 

Atari,  Inc.  Combat  game  program  Instructions. 

Sunnyvale,  California;  Atari,  Inc.,  Consumer 
Division,  1  97  7  .  f CO 1  1-402-01  ) . 

Biersner,  R.  J.  &  Cameron,  B.  J.  Cognitive  nerfornance 
during  a  1000-foot  helium  Jiv«.  Aerospace 
Medicine,  1970,  4l_,  918-920. 

Bittner  Jr.,  A.  C.  Statistical  tests  t>»r  differential 
stability.  Proceedings  of  the  .*  rd  Ann.uii 
Meeting  jf  The  Human  Factors  ,  Bo t  ui , 

October,  L979  (In  press). 

Bradley,  J.  V.  Practice  to  an  asymptote.  Jonrna 1 
of  Mo  tor  Behavior,  1989.  28>-.*u-. 

Comal  ii,  B.  K.  ,  Wapner,  5.  6  Woriur,  H.  Interfe rein.  ♦ 
effects  of  Stroop  Color-Word  Test  In  childhood, 
adulthood  and  aging.  Journal  of  Genetic 
Psychology ,  1982,  1 00 ,  57-51. 

Damos,  D.  L. ,  Kennedv,  R.  S .  &  Bittner,  Jr.,  A.  C. 

Development  of  Performance  Evaluation  Tests  tor 
Environmental  Research  (PETEK):  Critical  tracking 
test.  Proceedings  of  the  50th  Annual  Meeting 
of  the  Aerospace  Medical  Association,  Washington, 
D.C.,  May,  1979.  (AD  A066719) 

Dyer,  F.  N.  The  Stroop  phenomenon  and  its  use  in  the 
study  of  perceptual,  cognitive,  and  response 
processes.  Memory  and  Cognition,  1973,  J_, 

108-120. 


Flowers,  J.  H.  &  Stoup,  C.M.  Selective  attention 
between  words,  shapes  and  colors  in  speeded 
classification  and  vocalization  tasks. 

Memory  and  Cognition,  1977,  299-307. 

Golden,  C.  J.  Effect  of  differing  number  of  colors  on 
the  Stroop  color  and  word  test.  Perceptual  and 
Motor  Skills,  1974,  39,  50. 

Golden,  C.  J.  A  group  form  of  the  Stroop  color  and 
word  test.  Journal  of  Personality  Assessment, 
1975,  2i>  386-388. 

Golden,  C.  J. ,  Marsel  la,  A.  J.  &  Golden,  E.  E. 

Personality  correlates  of  the  Stroop  color  and 
word  test;  more  negative  results.  Pe rceptual 
and  Motor  Skills,  1975,  4 1 ,  599-602. 

Humphries,  L.  G.  Investigation  of  the  simplex, 
Psychomet rlka ,  1960,  4^,  313-323. 

Jensen,  A.  R.  Scoring  the  Stroop  Test.  Acta 
Psychologica,  1965,  2£,  398-408. 

Jensen,  A.  R.  &  Rohver,  W.  D.  The  Stroop  Color-Word 
Test;  A  review.  Acta  Psychologica,  1966,  25, 
36-93. 

Jex,  H.  R.,  McDonnell,  J .  D .  &  Phatak,  A.  V.  A 
"critical"  tracking  task  cor  manual  control 
research.  IEEE  Transactions  on  Human  Factors  in 
Electronics,  1966,  HFE-7 ,  138-145. 

Jones,  M.  B.  Individual  differences.  In  R.  N.  Singer 
(Ed.).  The  Psychomotor  Domain.  Philadelphia: 

Lea  and  Fabinger,  1972. 

Jones%  M.  B.  Differential  processes  in  acquisition. 

In  E.  A.  Bilodea  and  I.  McD.  Bilodeau  (Eds.), 
Principles  of  skill  acquisition.  New  York: 
Academic  Press,  1969. 

Jones,  M.  B.  Stabilization  and  task  definition  in  a 
performance  test  battery.  Pennsylvania  State 
University  College  of  Medicine,  Final  Report  on 
Contract  N0023-79-M-5089 ,  May  1979. 

Jones,  M.  B. ,  Kennedy,  R.  S.  &  Bittner,  Jr.,  A.  C.  A 
video  game  for  performance  testing.  Paper  pre¬ 
sented  at  the  Rocky  Mountain  Psychological 
Association  Annual  Meeting,  Los  Vegas,  NV,  May 
1979. 

Kennedy,  R.  S.  &  Bittner  Jr.,  A.  C.  The  development 
of  a  Navy  Performance  Evaluation  Test  for 
Environmental  Research  (PETER).  In 
Productivity  Enhancement:  Personnel 
Performance  Assessment  in  Navy  Systems,  Naval 
Personnel  Research  and  Development  Center,  San 
Diego,  CA  12-14,  October  1977.  (AD  A056047) 

Kennedy,  R.  S.  6  Bittner  Jr.,  A.  C.  Progress  in  the 
analysis  of  a  Performance  Evaluation  Test  for 
Environmental  Research  (PETER).  Proceedings 
of  the  22nd  Annual  Meeting  of  the  Human  Factors 
Soc iety,  Detroit,  MI,  October  1978. 

(AD  A060676) 

Kennedy,  R.  S.  &  Bittner,  Jr.,  A.  C.  Development  of 
Performance  Evaluations  Tests  for  Environmental 
Research  (PETER);  Complex  counting  test. 

Journal  of  Aviation  Space  and  Environmental 
Medicine .  (in  press) 


23 


Keele,  S.  W.  Attention  demands  of  memory  retrieval. 
Journal  of  Experimental  Psychology.  1972,  93. 
245-248. 

McCauley  M.  E.,  Kennedy,  R.  S.  4  Bittner,  Jr.,  A.  C. 
Development  of  Performance  Evaluation  Tests  for 
Envlrotmental  Research  (PETER):  Time  estimation 
test.  Proceedings  of  the  23rd  Annual  Meeting  of 
the  Human  Factors  Society.  Boston,  October,  1979 
(in  press) . 

Perl,  R.  E.  An  application  of  Thurstones'  method  of 
factor  analysis  to  practice  series.  Journal  of 
Geneal  Psychology.  1934,  J_l_,  209-212. 

Reilly,  R.  E.  &  Cameron,  B.  J.  An  integrated 
measurement  system  of  the  study  of  human 
performance  in  the  underwater  environment.  Falls 
Church,  VA:  Bio-Technology,  Inc.  1968. 

Rose,  A.  M.  Human  Information  Processing:  An 
Assessment  and  Research  Battery.  Doctoral 
Dissertation,  Ann  Arbor,  MI:  University  of 
Michigan,  1974,  (also  published  as 
AFOSR-PR-74-1372) .  AD-785-411. 

Sarmany,  I.  Different  performance  In  Stroop's 

interference  test  from  the  aspect  of  personality 
and  sex.  Studia  Psychologlca,  1977,  19,  60-67. 


Schilling,  C.  W.,  Werts,  M.  R.  4  Schandelmeier,  N.  R. 
(Eds.)  The  Underwater  Handbook:  A  Guide  to 
Physiology  and  Performance  for  the  Engineer.  New 
York:  Plenum  Press,  1976. 

Seales,  D.  M. ,  Kennedy,  R.  S.  4  Bittner,  Jr.,  A.  C. 
Development  of  Performance  Evaluation  Tests  for 
Environmental  Research  (PETER):  Arithmetic  compu¬ 
tation.  Proceedings  of  the  23rd  Annual  Meeting 
of  the  Human  Factors  Society,  Boston,  October, 

1979  (in  press) . 

Stroop,  J.  R.  Studies  of  interference  in  serial  verbal 
reactions.  Journal  of  Experimental  Psychology, 
1935,  J£,  643-662. 

Stroop,  J.  R.  Factors  affecting  speed  in  serial  verbal 
reactions.  Psychological  Monographs.  1938,  50, 
38-48.  ‘  ~  ~ 


Table  1 

Black  and  White  Word 
Reliabilities  Over  15  Days 


»*  .»»  .13  *0  .*0 


Table  2 

Color  Block  Reliabilities 
Over  15  Days  (n-19) 


Thomas,  D.  J.,  Majewski,  P.  L. ,  Ewing,  C.  L.  4  Gilbert, 
N.  S.  Medical  Qualification  Procedures  for 
Hazardous-duty  Aeromedical  Research.  (Conference 
Proceedings  No.  231,  A3,  pp.  1-13,  1978)  London: 
AGARD,  1977. 

Williams,  E.  The  effects  of  amount  of  information  in 
the  Stroop  color  word  test.  Perception  and 
Psychophysics.  1977,  22,  463-470.  (a) 


Table  3 

Color  Word  Reliabilities 
Over  15  Days  (n=19) 


Table  4 

BW-CB  Reliabilities 
Over  15  Days  (n-19) 


Table  5 

CB-CW  Reliabilities 
Over  15  Days  (n-19) 


1*  .10  -.11 


.50  .44  .47 


.44  tot  t  .05  j 
.54  for  t  .01 


.41  -.01 

.2*  .04 


Table  6 

Performance  Specif icat ion  Criteria  (Stabilization)  for  Performance  Tests 


STANDARD 

STABILITY  OF 

OVERALL 

TEST 

MEANS 

DEVIATIONS 

CORRELATIONS 

STABILITY 

S troop  BW 

Asymptote 

Level 

Yes 

Yes 

CB 

Asymptote 

Level 

Yes 

Yes 

CW 

Asymptote 

Level 

Yes/Marginal 

Yes 

BW-CB 

Asymptote 

level 

No 

No 

CB-CW 

Asymptote 

Level 

No 

No 

Time 

Slow  Increase 

Estimation 

Plateau 

or  Level 

No 

No 

Complex 

Counting 

Plateau 

Level 

No/Marginal 

No 

Critical 

Slow 

Tracking 

Increase 

Level 

Yes/Marginal 

Yes 

Arithmetic 

Slow 

Slow 

Increase 

Inc rease 

Yes 

Yes 

Air  Combat 

Slow 

Slow  Increase 

Maneuvering 

Inc rease 

or  Level 

Yes 

Yes 

FIGURES 


“r 


DATS 


Figure  1.  Stroop  Test:  Group  means  for  black 
and  white  words  (BW),  color  blocks  (CB),  color  words 
(CW),  BW-CB,  and  CB-CW  over  15  days  (n-19). 


BATS 

Figure  2,  Stroop  Test:  Standard  deviations  for 
BW,  CB,  CW,  BW-CB  and  CB-CW  over  over  15  days  (n-19). 


DATS  AfTER  BASE  PERFORMANCE 


Figure  3.  Stroop  Test:  BW  reliabilities  for 
selected  base  days  (1,  2,  4,  9  &  13)  and  those 
following  over  15  days  (n-19). 


VI 


DATS  AFTER  BASE  PERFORMANCE 


Figure  4.  Stroop  Test:  CB  reliabilities  for 
selected  base  days  (l,  2,  4,  9  &  13)  and  those 
following  over  15  days  (n-19). 


DATS  AFTER  BASE  PERFORMANCE 


Figure  5.  Stroop  Test:  CW  reliabilities  for 
selected  base  days  (1,  2,  4,  9  4  131  and  those 
following  over  15  days  (n-19). 


st¬ 


airs  AFTER  BASE  PERFORMANCE 

Figure  6.  Stroop  Test:  BW-CB  reliabilities  for 
selected  base  days  (1,  2,  4,  9  4  13)  and  those 
following  over  15  days  (n-19). 


0  7- 


l  A  *  «  t  «  T  •  *  V«  l\  >»  M  F« 


DATS  AFTER  BASE  PERFORMANCE 

Figure  7.  Stroop  Test:  Cd-CW  reliabilities  for 
selected  base  days  (1,  2,  4,  9  4  13)  and  those 
ioiiowlng  over  15  days  (n-19). 


26 


30 

25 


I  2  3  4  S  6  r  S  9  0  II  12  13  M  13 


Mrs 

Figure  8.  Time  Estimation  Test:  Group  means  (X) 
and  standard  deviations  (SD)  for  constant  error  score 
over  15  days  (n»19i. 


DAYS  AFTER  BASE  PERFORMANCE 


Figure  9.  Time  Estimation  Test:  Reliabilities 
of  constant  error  score  for  selected  base  days  (1,  2, 

4,  8,  10  &  12)  and  those  following  over  15  days  (n=>19)  . 


DAYS 


Figure  10.  Complex  Counting  Test:  Group  means 
(X)  and  standard  deviations  (SD)  for  percent  correct 
over  15  days  (n*19). 


DAYS 


Figure  12.  Critical  Tracking  7?st:  Group  means 
(X)  and  standard  deviations  (SD)  over  15  days  (n=18). 


DAYS  AFTER  RASE  PERFORMANCE 


Figure  13.  Critical  Tracking  Test:  Reli¬ 
abilities  for  selected  base  days  (1,  2,  4,  9  &  13)  and 
those  following  over  15  days  (n=18). 


Figure  14.  Arithmetic  Test:  Group  means  (X)  and 
standard  deviations  (SD)  for  total  correct  over  15 
days  (n-18). 


Figure  11.  Complex  Counting  Test:  Reliabilities 
of  percent  correct  for  selected  base  days  (l,  2,  4,  9 
&  13)  and  those  following  over  15  days  (n*L9). 


DAYS  AFTER  RASE  PERFORMANCE 


Figure  15.  Arithmetic  Test:  Reliabilities  of 
total  correct  for  selected  base  days  (1,  2,  4,  8,  10  & 
12)  and  those  following  over  15  days  (n»18). 


27 


AVERAGE  NUMBER 


DAY 

Figure  16.  Air  Combat  Maneuvering  Test:  Group 
means  and  standard  deviations  for  number  of  hits  over 
15  days  (n*13). 


Figure  17.  Air  Combat  Maneuvering  Test:  Reli¬ 
abilities  of  number  of  hits  for  selected  base  days  (1, 
2t  A,  6,  10  &  12)  and  those  following  over  15  days 


PROCEEDINGS  OF  THE  2ATH  ANNUAL  MEETING  OF  TOE  NEMAN  FACTORS  SOi  nr* 
LOS  ANGELES,  CA,  11-17  OCTOBER  1180 


PERFORMANCE  EVALUATION  TESTS  FOR  ENVIRONMENTAL  RESEARCH  (PETER) : 
AUDITORY  DIGIT  SPAN  task1 2 

2 

Denise  B.  McCafferty 
University  of  West  Florida 

Alvah  C.  Bittner,  Jr.  and  Robert  C.  Carter 
Naval  Biodynamics  Laboratory 
New  Orleans,  LA  70169 

ABSTRACT 


Auditory  digit  span  was  evaluated  as  an  instrument  for  repeated  measurements  experimentation. 
Twelve  subjects  were  tested  for  one  hour  on  each  of  12  consecutive  workdays  in  a  standard  environment. 
Both  forward  and  backward  digit  span  were  measured.  It  was  found  that  forward  digit  span  was  suitable 
for  repeated  measures  after  ten  days  of  practice  at  30  minutes  per  day.  The  criteria  for  suitability 
were  predictability  of  the  mean  scores,  constancy  of  the  standard  deviations  and  differential  stability 
of  the  intertrial  correlations.  These  criteria  are  sufficient  conditions  both  for  repeated  measures 
Analysis  of  Variance,  and  for  interpretation  of  experimental  effects.  Although  the  backward  digit 
span  scores  did  not  meet  these  criteria,  they  became  more  and  more  correlated  with  the  forward  digit 
span  scores  as  the  experiment  progressed.  This  indicates  that  the  mental  content  of  the  two  tests  of 
memory  converged  with  practice.  One  implication  of  this  finding  is  to  question  the  meaningfulness  of 
factor  structure  after  only  limited  practice.  The  forward  auditory  digit  span  test  was  recommended 
for  inclusion  in  a  battery  of  Performance  Evaluation  Tests  for  Environmental  Research  (PETER). 


INTRODUCTION 

Background 

Tests  of  human  cognitive  and  psychomotor 
ability  are  being  evaluated  for  inclusion  in  a 
battery  of  Performance  Evaluation  Tests  for 
Environmental  Research  (PETER).  PETER  is  a  human 
performance  task  battery  which  is  being  specifi¬ 
cally  designed  by  the  Naval  Biodynamics  Labora¬ 
tory  for  repeated  administration  in  unusual 
environments  (e.g.,  ship  motion,  vibration, 
hyperbaria,  thermal  extremes,  drug  administra¬ 
tion)  (Kennedy  A  Bittner,  1977 J  Kennedy,  Bittner, 
A  Harbeson,  1980) .  Candidate  tests  must  meet  at 
least  one  of  the  following  criteria:  (l) 
measure  some  aspect  of  information  processing; 

(2)  be  neurophyslologically  diagnostic,  or  (3) 
show  sensitivity  to  unusual  environments  (Kennedy 
A  Bittner,  1977;  Kennedy,  et  al.,  1980). 

Before  tasks  are  included,  they  must  be 
found  suitably  stable  for  simple  analysis  and 
interpretation.  Kennedy  et  al .  (1980)  and  Jones 
(1980)  have  suggested  that  stability  exists  when: 
(a)  mean  performance  reaches  an  asymptote  or 
evidences  a  slight  constant  slope,  (b)  day-to-day 
variance  is  constant,  and  (c)  relative  perfor¬ 
mance  standings  among  subjects  are  constant  from 
day  to  day,  as  indicated  by  unchanging  intertrial 
correlations  (differential  stability).  The  first 
of  these  stability  criteria,  for  the  means,  was 
indicated  by  Campbell  and  Stanley  (1966)  as 
required  for  meaningful  interpretation  of  experi¬ 
mental  results.  The  latter  two,  for  variances 
and  correlations,  were  derived  from  the  suffi¬ 
cient  (covariance)  matrix  condition  for  repeated 

1  This  research  was  performed  under  Navy  Work 
Unit  No.  MF58.524. 002-5027.  The  opinions  are 
the  authors'  and  do  not  necessarily  reflect 
those  of  the  Department  of  the  Navy. 

2  Now  at  the  Essex  Corporation,  Alexandria,  VA. 


measures  Analysis  of  Variance  (Winer,  1971). 

PETER  requires  all  three  of  these  stability 
criteria. 

Purpose 

The  present  study  was  undertaken  to  determine 
whether  baseline  performance  on  Auditory  Digit 
Span  (ADS)  (Ekstrom,  French,  Harman,  and  Derman, 
1976;  Wechsler,  1958)  would  stabilize  following 
repeated  administration  of  both  ADS  forward  (DF) 
and  backward  (DB) . 

METHOD 

Subjects 

Subjects  were  9  healthy  Navy  enlisted  males 
(ages  18  to  25)  assigned  to  the  Naval  Biodynamics 
Laboratory,  New  Orleans,  as  full-time  volunteer 
research  subjects.  All  volunteer  research  sub¬ 
jects  were  recruited,  evaluated  and  employed  in 
accordance  with  procedures  specified  in  Secretary 
of  the  Navy  Instruction  3900.39  and  Bureau  of 
Medicine  and  Surgery  Instruction  3900.6.  These 
instructions  are  based  upon  voluntary  informed 
consent,  and  meet  provisions  of  prevailing  national 
and  international  guidelines.  Each  subject  was 
selected  for  his  mental  and  physical  ability  to 
withstand  possible  hazardous  environmental  research. 
Subjects  were,  however,  considered  representative 
of  the  Navy  enlisted  population  in  intelligence 
(c.f.  Thomas,  Majewski,  Ewing,  A  Gilbert,  1978). 

Apparatus 

The  Ekstrom  et  al .  (1976)  Auditory  Number 
Span  Task,  based  on  the  seminal  work  of  Kelly 
(1954),  was  used  as  a  model  for  the  development 
of  52  alternate  forms,  28  DF  and  24  DB.  In 
accordance  with  Ekstrom  et  al .  (1976),  each  form 
consisted  of  24  separate  series  of  digits.  Each 
series  contained  between  4  and  12  digits. 


FORWARD 


The  28  DF  and  24  DB  tests  were  randomly 
assigned  to  the  12  days  of  presentation.  The  four 
extra  forms  of  DF  were  used  during  a  two-day  pilot 
experiment  which  Immediately  preceded  the  12  days 
of  the  main  experiment.  In  the  following  dis¬ 
cussion,  the  third  day  of  exposure  to  DF  testing 
will  be  called  Day  1  so  that  the  results  for  DF 
and  DB  can  be  described  on  a  common  time  line. 

Each  day  of  testing  consisted  of  2  different  forms 
of  DF  and  DB,  so  that  the  wlthln-session  reliabi¬ 
lity  of  the  tests  could  be  assessed. 


A  B  I  3  ,  A  f.  7  ,  a  1 1,  I  I  u 

im 


Readings  of  the  lists  were  recorded  on  an  R 
Ampex  600  reel-to-reel  tape  recorder  using  Ampex 
641  magnetic  tape.  Wechsler's  method  of  reading 
one  digit  per  second  with  a  drop  in  voice  inflec¬ 
tion  on  the  last  digit  in  a  series  was  used  (c.f. 
Hagen,  Durham,  6  Shannon,  1977). 

Procedure 

Subjects  were  tested  in  groups  between  0745 
and  0845  on  12  consecutive  workdays.  Prior  to  the 
experiment,  orientation  to  the  task  was  held  which 
involved  an  explanation  of  the  task.  Instructions, 
and  task  demonstrations. 

Sessions  consisted  of  four  15  minute  sections, 
two  DF  and  two  DB.  Instructions  were  given  prior 
to  each  section.  On  the  DF  portion  of  the  task, 
subjects  were  Instructed  to  listen  to  tape  recorded 
numbers.  Upon  cue,  they  were  to  write  those  num¬ 
bers  on  their  answer  sheets  in  the  exact  order  in 
which  they  were  presented.  (Response  time  of  2 
seconds  per  presented  digit  was  allotted).  Fol¬ 
lowing  Ekstrora  et  al .  (1976),  a  subject's  scores 
were  the  number  of  correctly  recorded  series. 
Therefore,  scores  for  DF  or  DB  could  range  between 
0  and  48  for  the  two  forms  composed  of  24  series 
each.  Instructions  for  DB  were  the  same,  with  the 
exception  that  subjects  were  instructed  to  write 
their  answers  in  the  exact  opposite  order  to  which 
they  were  called  out. 

RESULTS 

The  data  were  analyzed  in  two  phases.  During 
the  first  phase,  the  DF  and  DB  tasks  were  checked 


Figure  1.  Mean  Total  Correct  Forward  and  Backward 
Digit  Span  for  12  Days  (N»9) 


TABLE  1 


ANALYSIS  OF  VARIANCE  (ANOVA) 


DF 

SS 

F 

£ 

DF  vs  DB 
INSTRUCTION 

(I)  1 

1148.44 

88.00 

10‘10 

DAYS 

11 

931.09 

5.49 

10-8 

I  X  D 

11 

162.30 

1.13 

n.s . 

SUBJECTS 

8 

4177.29 

40.02 

10-10 

RESIDUAL 

184 

2400.49 

TOTAL 

215 

8819.70 

An  Fmax  test  comparing  the  largest  to  the  smallest 
with in-day  variances  on  the  DF  and  DB  tasks  found 
no  significant  difference  for  forward  and  backward 
tasks  respectively,  (Fmax  »  6.25,  and  F)nax„  " 
5.92,  £>.10). 

Intertrial  Correlations.  Figures  2  and  3 
demonstrate  the  pattern  of  the  DF  and  DB  correla¬ 
tions  between  scores  obtained  on  days  near  the 
onset  of  testing  and  those  obtained  on  later 
days.  These  figures  show  correlations  of  scores 


for  stability.  The  structure  of  the  forward  and 
backward  portions  of  the  test  were  compared  during 
the  second  phase . 


Task  Stability 


Means  and  Standard  Deviations.  Figure  1 
shows  the  average  DF  and  DB  scores  over  days.  It 
appears  that  DF  means  are  larger  than  DB  and  that 
the  difference  is  constant.  Table  1  supports  this 
view  with  a  significant  difference  between  the 
means  for  total  forward  and  total  backwards  and 
with  no  interaction  of  DF  versus  DB  and  days.  The 
effect  of  days,  it  is  noteworthy,  also  was  signifi¬ 
cant  reflecting  a  gain  in  performance  over  trials 
which  is  approximately  linear  after  Day  4  (for 
nonlinear  trends,  F(7,21)  ■  2.17,  £>s08).  The 
slope  of  the  linear  trend  is  0.11  series  per  day. 


I  2  3  44  K  7  H  <4  1 41 


DAYS  AFTER  BASE  DAY 

Figure  2.  Correlations  Between  Selected  Base 
Days  and  Following  Days  for  Total  Correct 
Forward  Digit  Span  for  12  Days  (N-9) 


DAIS  AFTER  BASE  DAI 


Figure  3.  Correlations  Between  Selected  Base 
Days  and  Following  Days  for  Total  Correct 
Backward  Digit  Span  for  12  Days  (N-9) 


obtained  on  selected  Base  Days  with  scores  obtained 
1,  2,  or  more  days  later.  This  method  is  helpful 
in  demonstrating  not  only  the  reliability  of  the 
task  over  time,  but  also,  in  the  case  of  DF,  dif¬ 
ferential  stability.  If  these  plots  were  flat  and 
overlapping  after  some  day  in  practice,  the  test 
scores  were  considered  to  be  stable  after  that  day 
(Bittner,  1979).  This  pattern  is  suggested  by  the 
lines  representing  Base  Day  9  and  following  days  on 
the  "total  forward"  graph.  Conversely,  the  down¬ 
ward  slope  of  the  lines  on  the  "total  backward" 
graph  does  not  indicate  stability.  Lawley  tests 
(Morrison,  1963)  supported  the  view  of  the  graphical 
analyses.  In  the  case  of  backwards  ADS,  the  Lawley 
test  indicated  significant  instability  of  the  inter- 
trial  correlation  matrix  (Xs  »  8.07,  df  »  2, 
p  <;.0l)  across  even  the  last  3  days  of  the  study, 
after  9  days  of  practice.  In  contrast,  stability 
was  found  for  forward  ADS  on  the  A  days  after  Day  8 
(X-2  “  8.377,  df  -  5,  p<.10).  If  the  two  extra 
days  practice  for  DF  are  considered,  we  can  con¬ 
clude  that  DF  stabilizes  after  10  days  of  practice 
for  h  hour  per  day.  The  intertrial  correlation 
matrices  for  DF  and  DB  are  presented  in  Tables  2 
and  3. 


TABLE  2 

AUDITORY  DIGIT  SPAN  TASK:  Inter-day  Correlations 
for  Forward  Task  Over  12  Days  (N«9) 


DAYS 

2 

3 

A 

5 

6 

7 

8 

9 

10 

11 

12 

l 

58* 

83 

58 

79 

8A 

93 

60 

53 

68 

60 

60 

2 

70 

AO 

A9 

8A 

67 

A7 

69 

72 

76 

A8 

3 

71 

5A 

82 

8A 

51 

57 

59 

66 

A9 

A 

57 

6A 

59 

68 

79 

65 

67 

72 

5 

79 

89 

8A 

75 

75 

79 

91 

6 

89 

72 

77 

73 

79 

71 

7 

7A 

66 

72 

79 

76 

8 

80 

75 

78 

93 

9 

83 

90 

89 

10 

87 

80 

11 

88 

*  Decimal  Points  Omitted 


TABLE  3 

AUDITORY  DIGIT  SPAN  TASK:  Inter-day  Correlations 
for  Backwards  Task  Over  12  Days  (N«9) 


DAYS 

2  3 

A 

5 

6 

7 

8 

9 

10 

11 

12 

1 

68*  67 

63 

71 

AA 

32 

35 

35 

19 

-04 

21 

2 

95 

68 

92 

86 

87 

68 

70 

57 

52 

68 

3 

6A 

9A 

89 

86 

6A 

57 

55 

39 

54 

A 

67 

A8 

60 

88 

79 

73 

53 

59 

5 

83 

84 

63 

67 

48 

48 

63 

6 

92 

62 

61 

64 

52 

63 

7 

76 

78 

72 

72 

77 

8 

83 

94 

74 

79 

9 

69 

87 

89 

10 

f  A 

68 

11 

95 

*  Decimal  Points  Omitted 


Task  Structure 


The  second  portion  of  analysis  was  devoted 
to  the  examination  of  the  structure  of  the  forward 
and  backward  tasks  by  graphical  analysis  and 
analysis-of-var lance.  Reliabilities  for  each 
day's  total  scores  were  obtained  for  each  task. 

For  each  day,  the  square  root  of  the  product  of 
the  two  taska'  reliabilities  was  then  plotted  on  a 
graph  to  represent  the  maximum  expected  correla¬ 
tion  between  DF  and  DB  (see  Figure  A).  The  maxi¬ 
mum  theoretical  correlations,  it  is  noteworthy, 
would  be  obtained  when  all  of  the  reliable 
variance  on  both  DF  and  DB  tasks  measures  a  single 
"factor"  (Harman,  1975).  The  correlations  of 
digit  span  forward  and  backward  were  also  plotted 
on  this  graph.  This  was  an  attempt  to  show  the 
relationship  between  the  tasks,  given  the  maximum 
possible  correlation  allowed  by  the  reliabilities. 
It  is  clear  that  the  correlation  between  the  ADS 
tasks  approaches  the  maximum  possible  as  trials 
progress.  Hence,  DF  and  DB  converge  with  prac¬ 
tice  . 


ADJACENT  DAIS 


Figure  A.  Forward  and  Backward  Digit  Span 
Correlations  across  Adjacent  Days  Compared  with 
Correlation  Ceilings  (N-9) 


31 


In  contrast  to  the  convergence  of  content 
Illustrated  by  the  correlation  results.  Figure  1 
demonstrates  the  unchanging  difference  of  diffi¬ 
culties  across  days  represented  by  the  means  of 
forward  and  backwards.  As  noted  before,  there  was 
no  significant  interaction  of  instruction  (forward- 
backwards)  and  day.  This  indicates  that  the 
effects  of  instruction  and  experience  with  ADS  are 
additive  and  independent.  In  summation,  the  for¬ 
ward  and  backward  ADS  processes  appear  to  become 
more  similar  in  content  with  practice  but  their 
means  remain  different  by  a  constant  amount. 

DISCUSSION 

Task  Suitability  for  Performance  Tests 

The  forward  portion  of  the  auditory  digit 
span  task  (DP)  was  found  to  be  suitable  for  the 
PETER  battery.  In  particular,  the  change  of  the  DF 
mean  performance  was  found  to  be  approximately 
linear  after  Day  4  and  the  variances  evidenced  no 
significant  change  over  the  course  of  the  study. 

In  addition,  the  DF  task  was  found  to  be  differen¬ 
tially  stable  for  the  last  four  days  of  testing. 

The  reliability  of  the  DF  task,  it  is  pertinent  to 
note,  was  comparable  to  that  (r  -  .74)  reported  by 
Ekstrcxn,  et  al .  (1976)  with  £  -  0.86  over  the 
dif ferentially  stable  days.  The  DF  task  meets  the 
suitability  criteria  for  means,  variances,  and 
correlations  required  for  simple  analysis  and 
interpretation  (Kennedy  &  Bittner,  1977;  Kennedy, 
et  al. ,  1980). 

In  contrast  to  DF,  the  backward  task  (DB) 
failed  to  stabilize  suitably  for  consideration 
for  inclusion  in  PETER.  While  the  analysis  of 
means  showed  a  linear  increase  after  Day  4  and 
constant  variances,  the  task  did  not  evidence 
differential  stability  even  after  9  days  of  prac¬ 
tice.  The  average  reliability  of  DB  for  the  last 
three  days  was  moderately  high  with  £  -  0.76,  sug¬ 
gesting  ultimate  reliability  in  the  neighborhood  of 
that  seen  for  the  DF  task.  However,  convergence  of 
the  DB  on  the  DF  task,  as  seen  in  Figure  4,  sug¬ 
gests  that  the  DB  task  would  eventually  become 
stable  as  it  continued  to  approach  the  DF  task. 

This  approach  to  differential  stability  is  also 
suggested  by  the  slopes  of  the  traces  seen  in  the 
graphical  analysis  of  Figure  3.  The  slopes  of  the 
traces  appear  to  be  approaching  a  zero  slope  as 
base  days  become  later.  The  ultinwte  convergence 
of  the  DB  task  to  differential  stability  Is  an  era- 
plr^al  question  which  requires  more  Investigation. 
1  t i  task  is  unlikely,  however,  to  be  of  interest 

for  a  task  battery  as  DB  appears  to  he  approaching 
DF  which  is  already  stable. 

Implications  for  Performance  Testing 

The  implications  of  the  results  for  perfor¬ 
mance  testing  are  twofold.  First,  the  stability 
of  the  mean  and  standard  deviations  after  the 
fourth  day  would  have  misled  investigators  who  use 
only  these  statistics  for  determining  the  suitabi- 
1 ity  of  a  task  before  beginning  an  environmental 
investigation.  The  changing  character  of  what-is- 


being-raea8ured  (Alvares  &  Hulin,  1972),  as  indi¬ 
cated  by  the  intertrial  correlations,  would  not 
have  been  apparent  to  such  investigators  and  the 
meaningfulness  of  their  results  would  have  been 
unknowingly  compromised  (Bittner,  1979).  This 
would  be  particularly  true  when  the  magnitude  of 
change  of  the  intertrial  correlations  is  as  large 
as  the  changes  reported  by  other  investigators 
(Kennedy,  et  al . ,  1980).  The  second  implication 
of  this  investigation's  results  is  to  question  the 
meaningfulness  of  task  differences  seen  with  only 
one  or  two  trials  of  practice.  In  early  stages  of 
training,  DF  and  DB  tasks  measured  non-overlapping 
variance.  However,  with  more  training,  the  over¬ 
lap  was  seen  to  increase.  How  true  this  would  be 
for  other  tasks  currently  believed  to  measure  dis¬ 
tinct  abilities  is  an  empirical  question.  Current 
factor  batteries  (e.g.,  Ekstrora,  et  al . ,  1976) 
have  been  developed  based  on  performances  with  no 
or  only  one  trial  of  practice.  The  possibility  Is 
suggested  that,  with  repeated  testing,  the  pleth¬ 
ora  of  human  performance  factors  or  abilities  may 
converge  to  far  fewer  than  presently  thought. 

Both  implications  for  performance  testing  revolve 
around  the  issues  of  changes  in  the  character  of 
a  task  with  practice.  The  issues  deserve  greater 
attention  and  investigation. 

Con cl  us ion 

The  forward  portion  of  the  auditory  digit 
span  task  is  suitable  for  use  in  environmental 
research  employing  repeated  measures.  Auditory 
Digit  Span  is  recommended  for  inclusion  in  a  test 
battery  as  a  measure  of  inattention  or  freedom 
from  distraction  and  as  an  indicator  of  short 
terra  memory  or  neurophysiological  impairment. 

REFERENCES 

Alvares,  K.  M.,  &  Hulin,  C.  L.  Two  explanations 
of  temporal  changes  in  ability-skill  rela¬ 
tionships:  A  literature  review  and  theore¬ 

tical  analysis.  Human  Factors,  ’.97  2,  14, 
245-308.  * 

Bittner,  lr.,  A.  II.  Statistical  tests  for  dif¬ 
ferential  stability.  Proceedings  of  the 
23rd  Annual  Meeting  of  the  Human  Factors 
Society.  Boston,  October  1979. 

Campbell,  D.  T.  f.  Stanley,  J.  C.  Expe r imea tal 

and  quas i-exper imental  designs  for  research . 
Chicago:  Ran  McNally,  1Q66. 

Ekstrom,  R.  9 .  ,  French,  .1.  W.  .  Harman,  H.  H.  ,  f* 
Herman,  D .  Manual  por  kit  of  factor  refer¬ 
enced  cognitive  tests.  Princeton,  N.J.: 
Educational  Testing  Service,  19 7 6. 

Hagen,  R.  L.  ,  Durham,  T. ,  &  Shannon,  D.  Admini¬ 
stration  of  digit  span  on  Wechsier  and 
Binet:  Differences.  Journal  of  Clinical 
Psychology ,  1977  ,  _3A,  480-432. 

Harman,  H,  H. ,  Modern  factor  analysis.  Chicago: 

University  of  Chicago  Press,  lu75. 

.lone  s ,  M .  9 .  Stabilization  and  task  def  In  it  ion 
in  a  performance  test  battery.  (NUDE  Mono¬ 
graph  No.  M-0001)  New  Orleans,  LA:  Naval 
Biodvnamics  Laboratory,  1980. 


32 


Keliey,  H.  P.  A  factor  analysis  of  memory  ability 
(Ph.D.  thesis,  Princeton  University,  1954) . 
Educational  Testing  Service  Research  Bulletin, 
7_,  1954. 

Kennedy,  R.  S.,  4  Bittner,  Jr.,  A.  C.  The  develop¬ 
ment  of  a  performance  evaluation  test  for 
environmental  research  (PETER) .  In  L.  T. 

Pope  4  D.  Melster  (Eds.),  Productivity 
Enhancement:  Personnel  Performance  Assessment 
In  Navy  systems.  Symposium  presented  at  the 
Naval  Research  and  Development  Center,  San 
Diego,  October  1977,  393-408. 

Kennedy,  R.  S.,  Bittner,  Jr.,  A.  C.  4  Harbeson, 

M.  M.  An  engineering  approach  to  the  stand¬ 
ardization  of  performance  evaluation  tests 
for  environmental  research  (PETER). 

Proceedings  of  the  11th  Annual  Conference 
of  the  Environmental  Design  Research 
Association,  Charleston,  SC,  2-6  March, 

1980. 

Morrison,  D.  F.  Multivariate  statistical  methods. 
New  York:  McGraw-Hill,  1967. 

Thomas,  D.  J.,  Majewski,  P.  L. ,  Ewing,  C.  1.,  4 
Gilbert,  N.  S.  Medical  qualification  pro¬ 
cedures  for  hazardous  duty.  Aeromedlcal 
Research  Conference  Proceedings  (No.  231 
A3).  London:  AGARD,  1978,  1-13. 

Wechsler,  D.  Measurement  and  appraisal  of  adult 

Intelligence,  (4th  ed) .  Baltimore:  Williams 
4  Wilkins ,  1958. 

Winer,  B.  J.  Statistical  principles  In  experi¬ 
mental  design.  New  York:  McGraw-Hill, 

1971. 


33 


PROCEEDINGS  OF  THE  24TH  ANNUAL  MEETING  OF  THE  HUMAN  FACTORS  SOCIETY 
LOS  ANGELES,  CA,  13-17  OCTOBER  1980 


COMPARISON  OF  MEMORY  TESTS  FOR  ENVIRONMENTAL  RESEARCH 

Mary  M.  Harbeson,  Michele  Krause,  and  Robert  S.  Kennedy 
Naval  Biodynamics  Laboratory,  New  Orleans,  LA  70189 

ABSTRACT 

Four  memory  tests  were  considered  for  inclusion  in  a  human  performance  test  battery.  The  tests 
were  administered  to  23  Navy  enlisted  men  for  15  consecutive  days.  Group  means,  standard  deviations, 
and  cross-session  correlations  were  examined.  Two  of  the  tests.  Interference  Susceptibility  and  Free 
Recall,  met  the  initial  statistical  criteria  for  inclusion  in  the  test  battery.  However,  the  other 
tests.  Running  Recognition  and  List  Differentiation  failed  to  show  sufficient  task  definition  and 
reliability  in  their  present  form.  These  tests  are  compared  with  each  other  and  with  previous  memory 


research  studies. 

INTRODUCTION 

Memory  functions  are  among  the  complex  mental 
operations  which  are  involved  in  Navy  jobs  and 
play  a  role  in  the  effectiveness  of  Navy  systems. 
This  report  focuses  on  an  evaluation  of  four 
memory  tests  which  were  considered  for  the  Perfor¬ 
mance  Evaluation  Tests  for  Environmental  Research 
battery.  Comparisons  are  made  between  the  present 
study  and  research  by  Underwood,  Boruch,  and  Malmi 
(1977)  and  Fernandes  and  Rose  (1978)  in  which  the 
same  tests  were  examined  for  different  purposes. 
Present  efforts  are  devoted  to  the  development  of 
a  test  battery  which  will  be  used  to  determine  the 
extent  of  performance  decrements  in  stressful 
environments  (Harbeson,  Kennedy,  &  Bittner,  1979; 
Kennedy  &  Bittner,  1977,  1978;  Kennedy,  Bittner,  & 
Harbeson,  1980;  Kennedy,  Carter,  &  Bittner,  1980). 
Cognitive,  perceptual,  and  psychomotor  tests  which 
were  previously  shown  to  be  sensitive  to  several 
validity  criteria  have  been  selected  for  study 
(Carter,  Kennedy,  &  Bittner,  1980;  Kennedy,  Carter, 

&  Bittner,  1980).  Tests  meeting  these  initial 
criteria  are  administered  and  evaluated  to  deter¬ 
mine  whether  they  are  stable  and  reliable  after 
extended  practice.  Future  research  will  employ 
real  world  work  criteria  from  task  analyses  (Shannon, 
1980a)  in  order  to  select  and  validate  subsequent 
tasks. 

The  strategy  of  this  research  program  has 
been  to  administer  each  task  for  15  consecutive 
workdays  to  the  same  group  of  subjects.  Tasks  are 
considered  stable  if  after  practice:  (a)  the 
leans  are  level  or  evidence  a  slight,  constant 
slope  over  days,  (b)  the  standard  deviations  are 
level,  and  (c)  the  between  trial  correlations 
cease  to  change  over  trials.  In  addition,  cross¬ 
session  reliabilities  (task  definition)  must  be 
high  enough  to  differentiate  among  Individuals.  A 
correlation  of  .707  has  been  set  as  the  lower 
limit  for  acceptah il i t v .  Tests  which  are  both 
stable  and  have  adequate  task  definition  are 
selected  for  tentative  inclusion  in  the  test 
battery . 

The  four  memory  tests  in  this  study  were 
adapted  from  Fernandes  and  Rose  and  based  on  the 
earlier  work  of  Underwood,  Boruch,  and  Malmi. 

Each  test  was  designed  to  measure  a  different 
aspect  of  memory.  Free  Recall  was  designed  to 
measure  recall  or  retrieval  skill.  Running  Recog¬ 
nition  dealt  with  recognition  or  the  ability  to 
discriminate  between  memories.  List  Differentia¬ 


tion  was  used  as  a  measure  of  temporal  discrimina¬ 
tion,  and  Interference  Susceptibility  was  designed 
to  study  the  effects  of  proactive  interference. 

These  tasks  were  selected  as  representative  of  a 
larger  body  of  tasks  studied  by  Underwood,  et  al . 

The  authors  examined  the  interrelationships 
among  28  episodic  and  5  semantic  memory  tasks  in 
order  to  determine  the  correlations  among  various 
attributes  of  memory  (associative,  temporal, 
acoustic,  etc.).  Each  task  was  administered 
once,  to  200  college  students.  A  factor  analysis 
revealed  5  factors:  (a)  paired-associat e/ser i al , 

(b)  free  recall,  (c)  memory  span,  (d)  recognition/ 
frequency  discrimination,  and  (e)  verbal  discrim¬ 
ination.  These  factors  were  related  to  the  tasks 
rather  than  to  the  attributes. 

Fernandes  and  Rose  selected  6  tests  from  the 
Underwood,  et  al .  study.  These  authors  were 
interested  in  an  information-processing  approach 
to  the  problems  of  both  individual  differences 
and  memory  function.  Their  objective  was  an 
assessment  instrument  that  could  be  generalized 
to  a  wide  range  of  criterion  tasks.  Each  test  was 
administered  twice,  to  22  office  workers.  Fernandes 
and  Rose  employed  the  Underwood  stimulus  material 
for  their  first  session,  and  generated  equivalent 
alternate  forms  for  the  second  session.  The 
results  of  their  study  led  Fernandes  and  Rose  to 
propose  5  of  the  6  tests  as  candidates  for  their 
performance  battery,  omitting  Interference  Sus¬ 
ceptibility  because  of  extreme  variations  in 
group  performance.  They  further  commented  that 
the  memory  tests  appeared  more  related  to  general 
skill  in  encoding  and  storage  than  to  the  attri¬ 
butes  they  were  nominally  purporting  to  measure. 

In  the  present  study  four  of  the  six  tests 
used  by  Fernandes  and  Rose  were  administered  for 
15  consecutive  working  days.  Situational  Fre¬ 
quency  was  excluded  because  it  did  not  lend 
itself  to  easy  construction  of  alternate  forms. 
Because  of  time  constraints.  Digit  Span,  a  task 
similar  to  the  Memory  Span  Test  suggested  by 
Fernandes  and  Rose,  was  administered  to  a  different 
population  and  is  reported  separately  (McCaffertv, 
Bittner,  &  Carter,  1980). 

Purpose 

The  purpose  of  this  study  was  to  determine 
the  effects  of  extended  practice  on  four  memory 
tests  and  to  determine  their  suitability  for 


34 


inclusion  in  a  human  performance  test  battery. 

METHOD 

Task.  Descriptions 

Running  Recognition*  Subjects  were  shown  a 
long  list  of  words  and  were  asked  to  indicate 
whether  each  item  was  old  or  new  by  circling  the 
appropriate  response  on  their  answer  sheets.  An 
example  of  the  stimulus  presentation  is  shown  in 
Table  1.  This  test  was  based  on  a  test  developed 
by  Shepard  and  Teghtsoonian  (1961),  who  used 
numbers  rather  than  words.  The  Underwood  group 
designed  their  test  to  measure  recognition  sensi¬ 
tivity  and  an  acoustic  attribute.  The  test  in¬ 
cluded  words  of  different  acoustic  characteristics, 
which  were  repeated  at  different  lags  within  a 
list.  Two  lists  were  used,  one  containing  173 
words,  and  the  other  174.  Each  word  was  displayed 
for  4  seconds.  Fernandes  and  Rose  used  the  Under¬ 
wood,  et  al .  stimuli  to  construct  a  list  of  101 
words  for  each  of  their  two  testing  sessions.  \11 
words,  except  one,  appeared  twice  in  a  list,  and 
lags  between  the  words  varied  from  1  36  words. 

Each  word  was  displayed  for  3  seconds. 


List  Differentiation.  Three  distinct  lists 
of  four-letter  words  were  presented.  The  same 
words  were  arranged  in  random  order  on  the  response 
sheets,  followed  by  the  digits  1,  2,  and  3.  The 
subjects  were  required  to  indicate  the  list  to 
which  each  item  belonged  (see  Table  2).  Underwood, 
et  al .  administered  2  sets  of  3  lists  each,  with 
20  words  per  list  in  1  session.  The  response  time 
was  unpaced.  Fernandes  and  Rose  and  the  present 
study  followed  this  procedure  except  that  the 
response  time  was  set  at  3  minutes,  and  in  the 
present  study  only  1  set  of  3  lists  per  day  was 
used  . 

Free  Recall.  The  Fernandes  and  Rose  stimulus 
material  was  used  and  additional  alternate  forms 
were  generated  following  the  Underwood,  et  al . 
method.  Subjects  were  shown  lists  of  common  words 
and  were  instructed  to  write  as  many  as  they  could 
remember  on  their  answer  sheets.  Three  conditions 
were  used:  control,  concrete,  and  abstract.  An 
example  of  the  stimulus  material  is  shown  in  Table 
3.  The  control  condition,  which,  was  described  by 
Fernandes  and  Rose  as  a  measure  of  short-term 
memory,  consisted  of  five-letter  words  selected  at 
random  from  the  Thorndike-Lorge  (1944)  tables. 


TABLE  1 

Running  Recognition:  Example  of  Stimulus 
Presentation 


STIMULUS 

RESPONSE 

SHEET 

INCOME 

CTtLFyp 

OLD 

BUILD 

CEU? 

OLD 

INCOME 

NEW 

CHATTER 

CjTevO 

OLD 

In  the  present  experiment,  the  Fernandes  and 
Rose  procedure  was  followed  but  the  lists  for  each 
day  were  reduced  to  51  words.  Alternate  forms 
were  generated  by  selecting  words  in  a  pseudo¬ 
random  manner  from  the  pool  of  101  original  stimulus 
words.  There  were  5  unique  orders  of  pr esentat ion . 

TABLE  2 

List  Dif ferent iat ion :  Example  of  Stimulus 
Presentation 


STIMULUS  MATERIAL  RESPONSE  SHEET 


LIST  1  LIST  2  LIST  3 


prow 

swab 

soon 

need 

i 

© 

3 

cost 

meet 

area 

thaw 

i 

2 

miss 

adds 

thaw 

cost 

Q 

2 

3 

foil 

that 

atop 

area 

i 

2 

b 

TABLE  3 

Free  Recall:  Example  of  Stimulus  Material 


CONTROL 

CONCRETE 

ABSTRACT 

au^ar 

body 

trouble 

yiei  1 

circle 

hour 

hors* 

gentleman 

method 

quote 

arrow 

affection 

The  concrete  and  abstract  conditions  were  designed 
to  measure  encoding  by  imagery.  Words  with  values 
above  6  on  the  Paivio,  Yuille,  and  Madigan  (1968) 
rating  scale  were  used  in  the  concrete  condition, 
and  those  with  values  below  3  were  used  in  the 
abstract  condition.  Underwood,  et  al .  used  4 
lists  for  the  control  condition  and  2  lists  each 
for  csncrete  and  abstract  conditions,  with  24 
words  per  list.  Subjects  were  shown  each  word  for 
4  seconds,  with  2  minutes  allowed  for  recall  at 
the  end  of  each  list.  Fernandes  and  Rose  followed 
the  same  procedure  as  Underwood,  et  al .  except 
that  lists  of  20  words  were  used,  and  the  presen¬ 
tation  time  and  response  time  were  reduced  by  50%. 
In  the  present  study,  only  2  lists  of  control,  and 
1  list  each  of  concrete  and  abstract  words  were 
used  each  day.  Approximately  30%  of  the  words  were 
used  twice,  with  the  contingency  that  the  same 
word  was  not  repeated  on  adjacent  days.  Testing 
time  occupied  7  minutes  per  session. 

Interference  Susceptibility.  Stimulus  material 
for  each  session  was  comprised  of  paired-associate 
lists.  A  list  was  made  up  of  5  three-letter  words 


35 


paired  with  the  digits  1-5.  Table  4  gives  an 
example  of  the  stimulus  presentation.  Each  set  con¬ 
sisted  of  4  lists,  in  which  the  same  words  and 
digits  were  used,  but  paired  differently  and  pre¬ 
sented  in  a  different  order  in  each  list.  Five  new 
words  were  paired  with  the  digits  1  -  5  in  each  set. 
After  each  paired-associate  list  was  presented,  the 
words  alone  were  shown  in  random  order  and  the  sub¬ 
jects  were  required  to  write  the  appropriate  digit 
on  their  response  sheets.  Six  sets  of  stimuli  were 
presented  in  both  the  Underwood,  et  al .  and 
Fernandes  and  Rose  studies.  Inspection  and  response 
times  for  each  item  were  3  seconds.  In  the  present 
study,  only  3  sets  per  session  were  presented. 


changed  slide  carousels  and  cassette  tapes.  T« 
ing  lasted  approximately  40  minutes  per  day. 

RESULTS  AND  DISCUSSION 
Running^  Recognition 

An  overall  percent  correct  score  was  calcu¬ 
lated.  Figure  1  shows  means  and  standard  devia¬ 
tions.  Group  means  begin  at  95%  and  decrease 
slightly  over  days,  ranging  between  95%  and  89%. 
The  average  standard  deviation  is  5.87%,  and 
although  variable,  did  not  show  any  positive  or 
negative  trend. 


Example  of  Interference  Susceptibility 


CORRECT 

INSPECTION  LIST  TEST  LIST  RESPONSE 


jJubjects 

The  subjects  were  a  group  of  23  volunteer  en¬ 
listed  Navy  men,  ages  19  to  24.  To  qualify  for 
this  medical  research  program,  they  had  to  be 
average  or  above  the  norms  for  Navy  enlisted 
personnel  in  physical  health,  mental  health  and 
intelligence.  All  subjects  were  recruited,  eval¬ 
uated  and  employed  in  accordance  with  procedures 
specified  in  Secretary  of  the  Navy  Instruction 
3900.39  and  Bureau  of  Medicine  and  Surgery  Instruc¬ 
tion  3900.6.  These  instructions  are  based  upon 
voluntary  consent,  and  meet  the  provisions  of 
prevailing  national  and  international  guidelines. 

A  description  of  the  subject  selection  procedure 
is  given  by  Thomas,  Majewskt,  Ewing,  and  Gilbert 
(1978)  . 


Figure  1.  Running  Recognition  means  and  standard 
deviations  for  percent  correct  across  15  days 
(n-2 3 ) . 


The  graph  of  the  cross-session  correlations 
wliich  is  shown  in  Figure  2  was  constructed  by 
plotting  the  correlations  between  a  base  day  and 
each  subsequent  day  (e.g.  Day  1  to  2,  1  to  3,  ... 

I  to  15).  Correlations  are  extremely  variable,  but 
there  is  no  obvious  trend.  Because  task  definition 
is  very  low  (r<.20),  this  test  does  not  meet  the 
minimum  cr.teria  for  inclusion  in  the  performance 
test  battery.  It  is  believed  that  in  the  present 

m 

H 

Z  »..r  ? 

w  J 

-  I- 


Apparatus 

The  stimulus  material  consisted  of  2X2  inch 
black  and  white  slides  with  one  item  per  slide 
presented  on  a  Kodak  Ektagraph  450  Audio  Viewer^. 
The  rate  of  presentation  was  controlled  by  prepro¬ 
grammed  tape  cassettes.  Subjects  recorded  their 
answers  on  response  sheets. 


Z  ‘f 

O 


;•/  ■ 


■Ax\  !■' 
7  v-  %J\ 


Procedure 

The  subjects  were  tested  in  groups  of  four 
beginning  at  3:00  AM  for  15  consecutive  workdays. 
The  four  tests  were  administered  in  the  same  order 
to  each  group  of  subjects,  but  the  order  varied 
for  different  groups.  There  was  a  break  of  2  or  3 
minutes  between  tests  while  the  experimenter 


DAYS 

Fig  ire  2 .  Running  Recognition  correlation  traces 
for  percent  correct  across  15  dnvs  (n  =  2 3  i  . 


study,  shortening  the  test  made  it  too  easy.  This 
may  have  caused  a  ceiling  effect,  which  lowered 
be  tween-sub ject  variance  and  therefore,  reliability. 
A  Spearman  adjustment  for  test  length  indicated  that 
making  our  test  comparable  in  length  would  raise  the 
correlations  to  what  Underwood  obtained.  However, 
a  23  minute  memory  test  would  be  prohibitively 
time  consuming  as  part  of  a  battery.  Possibly,  a 
selection  of  different  stimulus  material  (e.g., 
nonsense  syllables  or  abbreviations)  would  provide 
the  required  reliability  (sensitivity)  with  more 
modest  testing  time. 

List  Pi f ferent iation 

A  percent  correct  score  for  each  of  the  three 
lists  was  calculated.  Means  and  standard  devia¬ 
tions  for  the  three  lists  were  comparable.  The 
most  reliable  score,  however,  proved  to  be  percent 
correct  across  all  lists,  and  this  was  used  in 
subsequent  analyses.  Means  and  standard  devia¬ 
tions  appear  level  across  sessions  (Figure  3). 
Analysis  of  Variance  and  Fmax  tests  were  non- 
signif  leant . 


DAYS 

Figure  3.  List  Differentiation  means  and  standard 
deviations  for  percent  correct  across  15  davs 
(n-2  3) . 


2 


3 

3 

ui 

ac 

oc 

S 


Correlation  traces  (Figure  4)  are  generally 
low  (r  *  .37),  but  improve  somewhat  with  later 
days.  Early  traces  tend  to  decline  as  performance 
becomes  more  remote  from  the  base  day,  reflecting 
instability.  However,  with  the  exception  of  the 
final  day  which  was  extremely  low,  (c.f.  Shannon, 
1980b)  there  was  a  tendency  for  later  base  days  to 
have  higher  correlations  than  earlier  days  (for 
Days  9  -  14,  r  =  .64).  Therefore,  this  task  stabi¬ 
lizes  when  Day  15  is  dropped.  In  the  present 
study,  the  shorter  testing  time  (50%  of  Fernandes* 

&  Rose)  and  task  difficulty  may  have  contributed 
to  the  lower  correlations  (note,  the  average  per¬ 
cent  correct  score  across  15  days  was  45.25%). 

This  task  is  not  suitable  in  its  present  form,  but 
with  modifications  (e.g.,  stimulus  material  with 
more  meaningful  associations) ,  it  could  be  made 
acceptable  for  the  performance  test  battery. 

Free  Recall 


Percent  correct  scores  were  calculated  for 
the  control,  concrete  and  abstract  conditions. 

The  means  and  standard  deviations  for  all  condi¬ 
tions  followed  a  similar  pattern  and  as  expected, 
performance  was  generally  best  for  concrete  words 
and  poorest  for  abstract  words.  The  average  score 
across  al 1  conditions  was  used  in  the  analyses 
because  it  was  the  most  reliable  and  was  highly 
correlated  each  day  with  all  other  scores.  The 
means  and  standard  deviations  are  shown  in  Figure 
5.  With  the  exception  of  the  first  and  last  days, 
the  means  appear  level  with  a  gradual  increase 
across  sessions.  The  average  percent  correct 
score  across  days  is  35%.  A  significant  days 
effect  is  shown  in  the  analysis  of  variance 
F  (14,  308)  a  2.54,  Examination  of  the 

orthogonal  components  revealed  a  significant 
quartic  (4th  order)  effect.  First  and  last  day, 
and  weekend  effects  may  offer  an  explanation. 

The  standard  deviations  appear  level  across 
days  with  a  slight  increase  proportional  to  the 
means.  An  Fmax  test  showed  no  statistically 
significant  difference  across  days. 


DAYS 


Figure  4.  List  Differentiation  correlation  traces 
for  percent  correct  across  15  days  (n«23) . 


Figure  5.  Free  Recall  means  and  standard 
deviations  for  percent  correct  across  15  days 

(n-23). 


The  correlations  for  selected  base  days  and 
those  subsequent  appear  in  Figure  6  and  reflect  no 
dramatic  trend,  although  there  is  a  tendency  for 
later  day  correlations  to  be  higher  than  those  for 
earlier  days.  It  appears  that  the  correlations 
may  be  stable  as  early  as  Day  1.  _Task  definition', 
when  averaged  across  15  days,  is  r  -  .63  but 
reaches  r  *  .72  when  only  the  days  after  Day  9  are 
considered.  This  task  is  acceptable  for  inclusion 
in  the  human  performance  test  battery. 


OAYS  AFTER  BASE  PERFORMANCE 


Figure  6.  Free  Recall  correlation  traces  for  per¬ 
cent  correct  across  15  days  (n-23) . 

In t erf e rence  Suscept ,1b 11  i ty 

Percent  correct  scores  within  each  list, 
across  lists,  and  within  and  between  sets  were 
calculated.  In  addition,  slope  scores  were  calcu¬ 
lated  across  lists.  A  composite  mean  score  was 
used,  again,  because  it  was  the  most  reliable  and 
because  daily  part  scores  correlated  highly 
(generally,  r>.60)  with  each  other  and  with  the 
total  score.  The  slope  scores,  traditional  inter¬ 
ference  measure  possessed  zero  reliability.  Figure 
7  shows  the  means  and  standard  deviations.  Except 
for  the  extremely  low  score  on  Day  6,  the  means 
show  a  smooth  learning  curve  which  asymptotes  after 
Day  7.  The  grand  mean  percent  correct  is  65%,  in¬ 
creasing  from  a  low  of  50%  on  Day  1  to  a  high  of 
74%  on  Day  13.  Analysis  of  variance  shows  a  signi- 

H 

O 


Figure  7.  Interference  Susceptibility  means  and 
standard  deviations  for  percent  correct  across  15 
days  (n“23) . 


ficant  days  effect  for  Days  I  -  15,  F  (14,  308)  » 
7.40,  £<.01,  and  also  for  Days  7  -  15,  F  (8,  176)  ■ 
3.13,  p<.01.  This  could  be  explained  by  the  con¬ 
tinued  and  regular  increase  in  performance  across 
sessions.  The  standard  deviations  appeared  level 
throughout  testing.  A  non-signif leant  Fraax  con¬ 
firmed  this  observation. 

The  correlation  traces  (Figure  8)  appear  to 
follow  a  pattern  which  is  to  be  expected  when 
performances  improve  with  practice.  Like  the 
means.  Day  6  correlations  are  anomalous  and  while 
the  cause  is  unclear,  most  probably  reflect  proce¬ 
dural  or  apparatus  problems.  With  this  exception, 
the  traces  appear  to  be  fairly  level  for  each  day 
with  the  days  which  follow  and  increase  in  value 
for  subsequent  base  days.  The  figure  has  a 
layered  appearance  with  traces  for  later  days  being 
approximately  parallel,  and  higher  than  those  for 
earlier  days.  For  the  days  after  Day  7,  the 
traces  appear  to  overlap,  indicating  stability. 

The  average  correlation  for  Days  7  -  15  is  .73,  as 
opposed  to  .46  overall.  This  test  appears  accept¬ 
able  for  use  in  a  human  performance  test  battery. 

It  should  be  noted,  however,  that  since  the  measure 
of  interference  (slope)  had  a  zero  reliability,  the 
specific  memory  attribute  being  measured  by  this 
test  is  in  question. 


Figure  8.  Interference  Susceptibility  correlation 
traces  for  percent  correct  across  15  days  (n-23). 

Com  par ison  of  Te s  ts 

A  comparison  of  results  from  the  present 
study  and  past  research  on  these  tests  is  shown  in 
Table  5.  Data  from  the  past  studies  shown  in  this 
table  were  approximated  from  the  published  results. 
In  cases  where  no  reliabilities  were  given  for  a 
total  score,  the  reliabilities  for  each  condition 
were  averaged. 

For  the  most  part,  means  and  standard  de’  ia- 
tions  in  the  present  study  are  comparable  but  lend 
to  be  lower  than  those  previously  obtained.  Run¬ 
ning  Recognition  (RR)  has  significantly  lower 
correlations.  Correlations  for  List  Differentiation 
(LD)  are  low  when  only  Days  1  and  2  are  examined. 
However,  when  days  after  stability  are  considered, 
correlations  approach  those  in  past  studies. 
Interference  Susceptibility  (IS)  in  the  present 
study  reveals  higher  means  for  stable  days  than 


those  obtained  by  Fernandes  and  Rose  but  lower  than 
those  obtained  by  Underwood  et  al .  In  the  case 
of  Free  Recall  (FR),  means  are  substantially  lower 
than  those  In  the  Underwood,  et  al .  studv,  hut  are 
essentially  the  same  as  those  In  the  Fernandes  and 
Rose  study.  Different  presentation  times,  1  seconds 
in  the  Underwood,  et  al .  study  and  2  seconds  In  the 
other  two  studies,  may  account  for  the  discrepancy. 
In  general ,  the  differences  between  this  study  and 
past  research  may  be  attributed  to  (a)  decreased 
test  length,  (b)  modifications  In  the  testing  pro¬ 
cedure,  (c)  repetition  of  stimulus  material,  and 
(d)  subject  population  differences.  The  sample 
used  in  the  present  study  Is  representat  ive  of  the 
Navy  enlisted  population.  In  addition,  they  are 
comparable  to  the  general  population  on  at  least 
one  measure,  the  Wonderlic  Personnel  Test.  Even 
so.  It  is  expected  that  the  college  student  popu¬ 
lation  in  the  Underwood,  et  al .  study  may  he 
brighter  and  would  be  more  practiced  at  tests  In¬ 
volving  verbal  ability.  The  lower  reliabilities 
that  we  ohtatned  are  probably  the  consequence  of 
attempting  to  shorten  the  tests  so  that  they  could 
all  be  acccmpllshed  within  a  dally  session  lasting 
approximately  30  minutes.  It  is  our  opinion  that 
the  selection  of  more  relevant  (e.g..  joh  related) 
but  more  difficult  (e.g.,  abbreviations/acronyms) 
material  nay  permit  shorter  tests  at  no  sacrifice 
to  reliability.  This  will  be  attempted  In  a  future 
studv . 

TABLE  5 

Comparison  of  Three  Studies 


Underwood 


Fernandes 


Present  Study 


et  al  . 

(n  >  200) 

&  Rose 
(n  *  22) 

(n  *  23) 

Sessions 

1 

1 62 

1&2 

(Stable  Day! 

KR 

(1-15) 

Test  Time* 

23 

5 

2H 

X  (%) 

93 

93 

94 

91 

r  (x  100) 

70 

82 

30 

18 

LD 

(10-14) 

Test  Time 

7 

7 

4 

X  (Z) 

55 

50 

46 

46 

r  (x  100) 

71 

77 

42 

64 

FR** 

(9-15) 

Test  Time 

29 

13 

7 

X  (X) 

53 

38 

34 

35 

r  (x  100) 

67 

77 

68 

72 

IS 

(7-15) 

Test  Time 

12 

13 

6 

X  (X) 

85 

65 

54 

70 

r  (x  100) 

81 

77 

60 

73 

In  Table  b,  the  correlations  which  appear  in 
the  diagonal  are  the  composite  of  stabilized  days 
within  a  test.  Similarly,  the  between  test  correla¬ 
tions  which  appear  in  the  other  cells  are  also 
only  for  stabilized  days.  Thus,  reliability  corre¬ 
lations  for  List  Dlf ferentatlon  are  the  arithmetic 
average  of  10  comparisons  (Days  10-14)  and  Free 
Recall.  21  comparisons  (Days  9-15).  Moreover,  the 
composite  correlations  between  these  two  tests  are 
the  average  of  35  comparisons  (i.e.  days  10-14 
versus  9-1 5 ) . 

TABLE  6 


Intercorrelation  of 

Stable 

Periods 

of  Four  Memory 

Tests 

IS 

FR 

LD 

RR 

IS 

.73 

.50 

.32 

.25 

FR 

.72 

.51 

.17 

LD 

.64 

.21 

RR 

.18 

N.B.  Caution  should  be  taken  In  interpreting 
results  from  tests  of  different  lengths. 

*  Minutes 

**  Underwood,  et  al .  used  lists  of  24  words, 
whereas  the  other  studies  used  20  words  per  list. 


An  Inspection  of  this  table  reveals  correla¬ 
tions  between  stabilized  trials  that  are  higher 
than  the  factor  analysis  of  Underwood,  et  al . 
would  predict  since  the  tests  were  originally 
selected  for  orthogonality.  Indeed,  given  the 
average  low  retest  reliability  of  Running  Recogni¬ 
tion,  the  present  matrix  implies  only  a  single 
factor  for  all  four  tests.  When  calculations  were 
performed  over  earlier  (unstabilized)  trials  the 
data  were  more  in  line  with  the  low  correlations 
between  tests  found  by  Underwood,  et  al .  However, 
-when  Days  7-14  of  three  of  the  tests  (List  Differ¬ 
entiation,  Free  Recall,  and  Interference  Suscepti¬ 
bility)  were  factor  analyzed  by  Shannon  (1980a)  63 
percent  of  the  common  variance  was  explained  by 
one  factor.  These  data  suggest  that  following  ex¬ 
tended  practice  on  a  family  of  tests,  a  general 
factor  which  underlies  all  the  tests  may  appear. 

We  have  had  this  experience  previously  in  our 
laboratory  (McCafferty,  et  al,  1980;  Kennedv, 
Bittner,  6  Jones,  1980).  The  practical  consequences 
of  outcomes  like  this  imply  that  samples  of  prac¬ 
ticed  behavior  may  have  far  broader  generallzability 
than  was  previously  thought. 

CONCLUSIONS 

'in  conclusion ,  of  the  four  tasks  considered 
for  inclusion  in  a  human  performance  test  battery. 
Interference  Susceptibility  and  Free  Recall  were 
found  to  be  acceptable.  List  Differentiation  and 
Running  Recognition  were  not  acceptable  in  their 
present  forms  but  could  possibly  be  useful  if 
modified.  The  performance  on  the  four  tasks  was 
generally  comparable,  but  poorer  than  that  ob¬ 
tained  in  the  previous  studies.  In  addition,  it 
is  suggested  that  with  extended  practice  all  four 
tasks  may  measure  a  single  factor. 


39 


REFERENCES 

Carter,  R.  C. ,  Kennedy,  R.  S.,  &  Bittner,  A.  C.  , 

Jr.  Selection  of  Performance  Evaluation 
Tests  for  Environmental  Research.  Proceedings 
of  the  24th  Annual  Meeting  of  the  Human 
Factors  Society.  Los  Angeles,  October  1980. 

Fernandes,  K.  &  Rose,  A.  M.  An  Information  Pro¬ 
cessing  Approach  to  Performance  Assessment: 

An  Investigation  of  Encoding  and  Retrieval 
Processes  In  Memory.  (Tech*  Report  IAR  58500- 
11/78  TR).  Washington,  D.C.:  American 
Institutes  for  Research,  November,  1978. 

Harbeson,  M.  M.,  Kennedy,  R.  S.,  &  Bittr.er,  A.  C., 
Jr.  A  comparison  of  the  Stroop  Test  to  other 
tests  for  studies  of  environmental  stress. 
Proceedings  of  the  12th  Annual  Meeting  of  the 
Human  Factors  Association  of  Canada.  Brace- 
bridge,  Ontario,  Canada,  September,  1979, 

21.1.  21.9 

Kennedy,  R.  S.,  &  Bittner,  A.  C.,  Jr.  The  develop¬ 
ment  of  a  Navy  Performance  Evaluation  Test 
for  Environmental  Research  (PETER).  In 
Pope,  L.  T.  &  D.  Melster,  (Eds.),  Productivity 
Enhancement:  Personnel  Performance  Assessment 
in  Navy  Systems.  Symposium  presented  at  the 
Naval  Personnel  Research  and  Development 
Center,  San  Diego,  CA,  October  1977,  393-408. 
(NTIS  No.  AD  056047). 

Kennedy,  R.  S.,  &  Bittner.  A.  C. ,  Jr.  Progress  in 
the  analysis  of  Performance  Evaluation  Tests 
for  Environmental  Research  (PETER).  Pro¬ 
ceedings  of  the  J2nd  Annual  Meeting  of  the 
Human  Factors  'Society,  Detroit,  Michigan, 
October,  1978.  "(NTIS  No.  AD  A060676). 

Kennedy,  R.  S.,  Carter,  R.  C.,  &  Bittner,  A.  C., 

Jr.  A  catalogue  of  Performance  Evaluation 
Tests  for  Environmental  Research.  Pro¬ 
ceedings  of  the  24th  Annual  Meeting  of  the 
Human  Factors  Society,  Los  Angeles,  October, 

198(57“ - 

Kennedy,  R.  S.,  Bittner,  A.  C.,  Jr.,  &  Harbeson, 

M.  M.  An  engineering  approach  to  the  stan¬ 
dardization  of  Performance  Evaluation  Tests 
for  Environmental  Research  (PETER).  Pro¬ 
ceedings  of  the  1 1 th  Annual  Conf e fence  of  the 
Env ironment al  De sign^  Research  Association , 
Charleston,  S.C.,  March,  1980. 

Kennedy,  R.  S. ,  Bittner,  A.  C.,  Jr.,  &  Jones ,  M.  B. 
The  utility  of  ccmmerc  ial  ly  available  tele¬ 
vision  computer  games  for  assessing  perfor¬ 
mance  and  other  applications.  Preprints  of 
the  5 1  st_  ^Annu_a_l_  Scientific  Meeting  of  the 
Aerospace  Medical  Association,  Anaheim,  CA, 

May"  1980,  163-164 . 

McCafferty,  D.  B.,  Bittner,  A.  C.,  Jr.,  &  Carter, 

R.  C.  Performance  Evaluation  Tests  for 
Environmental  Research  (PETER):  Auditory 
digit  span  task.  Proceedings  of  the  24th 
Annual  Meeting  of  the  Human  Factors  Society, 
Eos  Angeles,  October,  1980. 

Paivio,  A.,  Yuille,  J.  C.,  &  Madigan,  S.  A. 

Concreteness,  imagery,  and  meaningfulness 
values  for  925  nouns.  Journal  of  Experiment al 
Psychology  tonograph,  196lf,  76,  (T,~  Pt .  2). 


Shannon,  R.  H.  Task  analytic  approach  to  human 

performance  battery  development.  Proceedings 
of  the  24th  Annual  Meeting  of  the  Human 
Factors  Society,  Los  Angeles,  October,  1980. 

(a) 

Shannon,  R.  H.  A  factor  analytic  approach  to 

determining  stability  of  human  performance. 
Proceedings  of  the  13th  Annual  Meeting  of  jHie 
Human  Fac tors  Association  of  Canada ,  Point 
Ideal,  Ontario,  Canada,  September,  1980.  (b) 
Shepard,  R.  N. ,  &  Teghtsoonian,  M.  Retention  of 
information  under  conditions  approaching  a 
steady  state.  Journal  of  Experimental 
Psychology,  1961,  62,  302-309. 

Thomas,  D.  J.,  Majewski,  P.  L. ,  Ewing,  C.  L. ,  & 
Gilbert,  M.  S.  Medical  Qualification  Pro¬ 
cedures  for  Hazardous-Duty  Aeromedical 
Research .  (Conference  Proceedings  No,  231, 
A3,  pp.  1-13,  1978)  London:  AGARD,  1977. 
Thorndike,  E.  L.  &  Lorge,  I.  The  teacher's  word 
book  of  30,000  words.  New  York:  Teachers 
College,  Bureau  of  Publications,  1944, 
Underwood,  B.  J.,  Borach,  R.  F.,  &  Malmi,  R.  A. 

The  composition  of  e p i sod 1  c  memo r y ,  (0NR 
Contract  No.  N0001 4-76-C-0270)  Evanston, 
Illinois:  Northwestern  University,  May 
1977.  (NTIS  No.  AD  A040696) . 


PROCEEDINGS  OF  THE  SEVENTH  PSYCHOLOGY  IN  THE  DOD  SYMPOSIUM 
USAF  ACADEMY,  COLORADO  SPRINGS,  CO  16-18  APRIL  1980 


Performance  Evaluation  Tests  for  Environmental  Research  (PETER): 

Interference  Susceptibility  Test  (1ST) 

Michele  Krause  and  Robert  S,  Kennedy 
Naval  Aerospace  Medical  Research  Laboratory  Detachment 
New  Orleans,  Louisiana 

Abstract 

A  program  designed  to  develop  Performance  Evaluation  Tests  for 
Environmental  Research  (PETER)  is  in  progress.  Underwood's  (1977)  Inter¬ 
ference  Susceptibility  Test  (1ST)  was  evaluated  for  inclusion  in  PETER  on 
the  basis  of  its  suitability  for  repeated  administrations.  Baseline 
testing  consisted  of  alternate  forms  of  the  1ST  being  administered  to  23 
subjects  for  15  workdays.  The  results  show  the  mean  of  the  total  percent 
correct  score  continues  to  exhibit  a  slow  increase  over  the  entire  experi¬ 
ment,  with  the  standard  deviation  remaining  constant  subsequent  to  Day  7. 
Reliability  correlations  appear  differentially  stable  after  some  training 
(r  .75).  The  slope  score,  the  traditional  measure  of  1ST,  is  unreliable, 
although  the  standard  deviations  are  relatively  constant.  The  total 
percent  correct  score  is  recommended  for  possible  inclusion  in  PETER. 


The  Navy  is  developing  Performance  Evaluation  Tests  for  Environmen¬ 
tal  Research  (PETER)  at  its  medical  laboratory  in  New  Orleansl  The  goal 
of  the  PETER  program  is  to  develop  a  multiple  administration  test  battery 
which  will  be  effective  in  detecting  performance  decrements  that  are 
caused  by  ship  motion.  Additionally,  due  to  its  nature,  the  test  battery 
is  expected  to  lend  itself  to  the  study  of  other  stressors,  such  as  toxic 
drugs,  extreme  temperatures  and  high  pressure.  The  current  phase  of  this 
project  involves  repeated  testings  of  cognitive,  perceptual  and  psycho¬ 
motor  tasks.  In  choosing  a  task  for  study,  one  or  more  of  the  following 
criteria  must  have  been  met:  (a)  performance  '  ts  been  shown  to  be 
disrupted  in  a  thermal,  inertial  or  hyperbaric  \ronment,  (b)  it  has 
been  acknowledged  to  assess  cognitive,  information-processing,  or  memory 
functions,  or  (c)  normal  subjects  have  been  distinguished  from  brain 
damaged  persons  (Kennedy  &  Bittner,  1977).  One  of  the  tasks  selected  for 
study  was  Underwood's  Interference  Susceptibility  Test  (1ST)  (Underwood, 
Boruch  &  Malmi,  1977).  This  task  was  originally  designed  by  Underwood  to 
study  the  effects  of  proactive  interference.  In  this  original  study,  200 
college  students  were  tested  on  24  separate  tasks.  Fernandes  and  Rose 
(1978)  included  the  test  in  their  studies  of  an  information-processing 
approach  to  performance  assessment.  It  is  suspected  that  the  more  basic 
memory  tasks  which  have  been  studied  at  NAMRLD  (e.g.  recall  and  recog¬ 
nition  tasks)  do  not  distinguish  memory  capacities  in  the  same  way  as  1ST 
does.  The  Interference  Susceptibility  Test  required  associations  to  be 
formed,  dismissed,  and  then  new,  conflicting  associations  formed  during 

The  opinions  are  those  of  the  authors  and  do  not  necessarily  reflect 
those  of  the  Department  of  the  Navy. 

This  research  was  performed  under  Navy  Work  Unit  No.  MF58. 524.002-5027. 
The  authors  are  indebted  to  Andrew  Rose  for  providing  stimulus  material. 


41 


exposure  to  persons  suffering  from  motion  sickness,  one  of  the  authors 
found  that  "confusion"  was  reported  as  a  frequent  mental  symptom.  It  is 
possible  that  1ST  is  sensitive  enough  to  measure  a  component  of  "confu¬ 
sion". 

The  purpose  of  the  present  study  is  to  determine  whether  1ST  is 
suitable  for  use  in  environmental  research.  From  our  point  of  view,  a 
task  is  considered  suitable  if  it  has  task  definition  (i.e.  differen¬ 
tiates  between  subjects)  and  is  stable.  In  accordance  with  Jones  (1979), 
stability  exists  when:  (a)  the  daily  group  means  asymptote  or  evidence 
a  slight,  constant  slope,  (b)  day-to-day  variance  is  constant,  and  (c) 
relative  performance  standings  between  subjects  are  constant  from  day  to 
day.  A  recommendation  of  whether  to  include  this  test  in  subsequent 
PETER  studies  is  based  on  these  criteria.  Reviews  which  describe  this 
program  in  detail,  as  well  as  describe  the  results  of  previous  tasks  that 
have  been  administered,  are  available  (Harbeson,  Kennedy  &  Bittner,  1979; 
Kennedy  &  Bittner,  1977;  Kennedy,  Bittner,  &  Harbeson,  1980). 

Method 

Subjects 

The  subjects  were  a  group  of  23  volunteer  enlisted  Navy  men,  ages  19 
to  24.  To  qualify  for  this  medical  research  program,  they  had  to  be 
within  the  norms  for  Navy  enlisted  personnel  in  physical  health,  mental 
health  and  intelligence.  All  subjects  were  recruited,  evaluated  and 
employed  in  accordance  with  procedures  specified  in  Secretary  of  the  Navy 
Instruction  3900.39  and  Bureau  of  Medicine  and  Surgery  Instruction 
3900.6.  These  instructions  are  based  upon  voluntary  consent,  and  meet 
the  provisions  of  prevailing  national  and  international  guidelines.  A 
description  of  the  subject  selection  procedure  Is  given  by  Thomas, 
Majewski,  Ewing  and  Gilbert  (1978). 

Task  description 

Stimulus  material  for  each  session  was  comprised  of  lists  of  tri¬ 
gram-digit  pairs  (e.g.  NOB-2).  A  list  was  made  up  of  five  trigrams 
paired  with  digits  from  1  to  5.  During  each  session,  three  sets,  each 
containing  four  lists,  were  administered.  Across  the  four  lists  of  each 
set,  the  same  trigrams  were  paired  with  digits  from  1  to  5,  forming 
different  combinations  in  each  list.  Stimulus  material  was  provided  by 
Rose.  An  example  of  stimulus  material  for  one  set  is  found  in  Table  1. 


Apparatus  and  procedure. 

Subjects*  were  shown  each  of  five  trigram-digit  pairs  byRmeans  of  a 
single  slide,  presented  on  a  Kodak  Ektagraph  450  AudioViewer  .  The  rate 
of  presentation  was  one  slide  every  3  seconds.  A  cueing  slide  appeared 
at  the  end  of  the  list  and  at  the  beginning  of  the  recall  list.  Each 
trigram  was  then  shown  by  itself  (in  an  order  different  from  the  paired 
presentation)  for  4  seconds,  and  subjects  recorded  the  number  with  which 
they  thought  each  trigram  had  been  paired.  Subjects  were  tested  in 
groups  of  four,  at  8:00  in  the  morning,  for  15  consecutive  workdays. 


Re suits 


Two  measures  were  taken  across  sets  for  four  lists:  (a)  slope  of 
lists  and  (b)  percent  correct  for  each  list.  In  addition,  mean  percent 
correct  was  obtained  for  each  of  three  sets  (summed  over  lists)  and  an 
aggregate  mean  (over  sets  and  lists)  was  obtained  in  order  to  compare 
results  with  Underwood,  et  al.  (1977). 

Figure  1  shows  the  mean  percent  correct  responses  across  sets  for 
the  four  lists.  As  expected,  performance  declines  with  each  successive 
list  that  is  presented.  The  impression  of  a  learning  curve  over  days  is 
observable  across  each  list.  The  greatest  improvement  is  seen  in  List  1 
(33%).  The  reason  for  the  anomalous  scores  on  Day  6  is  obscure.  Stan¬ 
dard  deviations,  as  seen  in  Figure  2,  are  level  and  unremarkable. 

Percent  correct  performance  for  each  of  the  three  sets  (summed  over 
lists)  showed  that  subjects  exhibit  a  slight  advantage  for  later  sets 
(not  shown),  although  the  differences  are  negligible.  Mean  performance 
for  the  three  sets,  across  lists  progresses  from  50.1  on  Day  1  to  71.8  on 
Day  15.  The  average  percent  correct  in  both  this  study  and  the 
Fernandes  &  Rose  (1978)  study  was  65%.  Underwood,  et  al.  (1977)  obtained 
an  85  percent  correct  average  when  this  test  was  interdigitated  with  23 
other  memory  tests. 

When  Underwood  et  al .  (1977)  correlated  total  correct  responses  for 
Sets  1,  3,  5  with  those  same  scores  from  Sets  2,  4,  6,  they  obtained  a 
value  of  r  =  .81.  This  correlation  between  successive  sets  (i.e.  split 
half)  in  Underwood's  study  is  compared  to  a  correlation  of  r  =  .74  be¬ 
tween  successive  days  (i.e.  test-retest)  in  the  present  research,  wherein 
the  number  of  observations  are  the  same  for  both  calculations.  There  is 
no  evidence  that  the  reliabilities  of  the  present  data  are  different  from 
those  of  Underwood  et  al.  (1977)  (z  =  .72,  p>.40). 

Tables  2  and  3  show  reliabilities  within  Lists  2  and  4.  Because 
Lists  1  and  3  revealed  comparable  results,  they  are  not  shown.  These 
correlations  reveal  that  average  percent  correct  performance  appears  to 
stabilize  around  Day  8.  This  result  is,  perhaps,  more  clearly  illus¬ 
trated  when  Table  2  is  graphed  as  in  Figure  3.  This  figure  presents 
correlations  of  percent  correct  performance  for  selected  testing  days  in 
a  left-justified  manner,  enabling  examination  of  all  subsequent  testing 
days.  Although  a  progression  towards  stabilization  occurs,  the  task 
definition  remains  too  low  to  be  satisfactory  (Jones,  1979). 

Figure  4  shows  the  means  and  standard  deviations  for  the  1ope 
scores  over  lists.  Mean  slopes  are  variable  and  show  no  systematic 
trend.  The  standard  deviations  are  equal  to  the  means  suggesting  sub¬ 
stantial  differences  between  subjects.  Table  4  shows  slope  reliabili¬ 
ties.  Composite  reliability  for  this  score  is  essentially  zero  (r  = 

.04). 


Discussion 


Percent  correct  scores  for  the  individual  lists  provide  evidence  for 
stabilization  within  the  second  week  of  testing,  but  with  task  definition 
at  too  low  a  level  to  be  considered  useful.  When  the  percent  correct 
scores  are  summed  over  lists  and  sets  task  definition  improves  (r  =  .71), 


43 


and  reliabilities  after  Day  8  appear  stable.  This  aggregate  score  is  the 
one  favored  by  Underwood,  et  al.  (1977),  who  found  it  to  be  correlated 
with  the  slope  measure.  While  less  defensible  as  a  measure  of  interfer¬ 
ence  susceptibility,  the  percent  correct  score  over  lists  and  sets  meets 
the  minimum  requirements  for  suitability  for  PETER  and  will  be  employed 
in  subsequent  analyses  at  this  laboratory.  It  should  be  noted  that  the 
test  in  its  present  form,  requires  ten  minutes  to  complete  and  yields  a 
composite  reliability  in  List  2  (as  an  example)  of  r  =  .53  Using  the 
Spearman-Brown  adjustment  formula  (Allen  &  Yen,  1979),  reliability  raises 
to  r  =  .69  if  the  testing  length  is  doubled.  The  total  aggregate  score 
improved  from  r  =  .71  to  r  =  .83. 

The  chief  finding  in  this  experiment  is  that  the  slope  score,  theore¬ 
tically  the  most  meaningful  measure  of  the  interference  factor,  is  unre¬ 
liable  (r  =  .04).  This  poor  reliability  over  sessions  is  not  due  to 
insufficient  variance  between  subjects  and  it  occurred  despite  the  fact 
that  the  slope  means  and  standard  deviations  are  stable.  Fernandes  and 
Rose  (1978)  also  obtained  low  reliability  for  the  slope  measure  (r  = 

.05).  It  is  probable  that  the  same  cautions  which  are  associated  with 
difference  scores  (Cronbach  &  Furby,  1970)  may  apply  to  slopes.  Those 
authors  suggest,  as  an  alternative,  analyzing  the  most  complex  condition 
with  the  simpliest  condition  as  a  covariate  (in  this  case,  List  4  with 
List  1  as  a  covariate).  This  analysis  will  be  performed  on  the  1ST  data 
at  a  later  date. 

In  conclusion,  1ST  as  analyzed  up  to  this  point,  is  not  an  ideal 
candidate  for  inclusion  in  future  PETER  studies.  It  is  recognized 
though,  that  with  some  modifications  to  the  adminstration  procedure,  this 
test  may  reveal  a  unique  factor  of  memory  that  would  be  useful  to  include 
in  the  final  PETER  battery.  It  may  prove  to  be  necessary,  when  studying 
other  environmental  stressors,  (specifically  impact  acceleration)  to 
place  heavier  emphasis  on  memory  tasks  because  of  the  close  connection 
between  memory  and  other  human  systems  and  functions. 

References 

Allen,  M.  J.  &  Yen,  W.  M.  Introduction  to  Measurement  Theory.  Belmont, 
California:  Wadsworth,  Inc.,  1979. 

Cronbach,  L.  J.  &  Furby,  L.  How  should  we  measure  "change"  -  or  should 
we?  Psychological  Bulletin,  1970,  JU_,  68-80. 

Fernandes,  K.  &  Rose,  A,  M.  An  Information  Processing  Approach  to  Perfor¬ 
mance  Assessment:  II.  An  Investigation  of  Encoding  and  Retrieval 
Processes  in  Memory.  (Tech.  Report  IAR  58500-11/78  TR) .  Washington, 
D.C.:  American  Institutes  for  Research,  November,  1978. 

Harbeson,  M.  M.  ,  Kennedy,  R.  S. ,  &  Bittner,  Jr.,  A.  C.  A  comparison  of 
the  Stroop  Test  to  other  tasks  for  studies  of  environmental  stress. 
Proceedings  of  the  12th  Annual  Meeting  of  the  Human  Factors  Associa¬ 
tion  of  Canada,  Bracebridge,  Ontario,  Canada,  6-8  September,  1979. 
Jones,  M.  B.  Stabilization  and  Task  Definition  in  a  Performance  Test 
Battery.  (NAMRL  Monograph  No.  27).  Pensacola,  FL:  U.  S.  Naval 
Aerospace  Medical  Research  Laboratory,  1980. 


44 


Kennedy,  R.  S.  &  Bittner,  A.  C.,  Jr.  The  development  of  a  Navy  Perfor¬ 
mance  Evaluation  Test  for  Environmental  Research  (PETER).  In, 
Productivity  Enhancement:  Personnel  Performance  Assessment  in 
Navy  Systems.  Symposium  presented  at  the  Naval  Personnel  Research 
and  Development  Center,  San  Diego,  CA,  12-14  October  1977.  (NTIS 
No.  AD  056047). 

Kennedy,  R.  S.  &  Bittner,  Jr.,  A.  C.  Progress  in  the  analysis  of  Perfor¬ 
mance  Evaluation  Tests  for  Environmental  Research  (PETER).  Pro¬ 
ceedings  of  the  22nd  Annual  Meeting  of  the  Human  Factors  Society, 
Detroit,  Michigan,  October,  1978.  (NTIS  No.  AD  A060676) 

Kennedy,  R.  S.,  Bittner,  Jr.,  A.  C.  &  Harbeson,  M.  M.  An  engineering 

approach  to  the  standardization  of  Performance  Evaluation  Tests  for 
Environmental  Research  (PETER) .  Proceedings  of  the  11th  Annual 
Conference  of  the  Environmental  Design  Research  Association, 
Charleston,  S.  C.,  March,  1980. 

Thomas,  D.  J.,  Majewski,  P.  L. ,  Ewing,  C.  L.  &  Gilbert,  M.  S.  Medical 
Qualification  Procedures  for  Hazardous-Duty  Aeromedical  Research. 
(Conference  Proceedings  No.  231,  A3,  pp.  1-13,  1978)  London: 

AGARD,  1977. 

Underwood,  B.  J. ,  Borach,  R.  F.  &  Malmi,  R.  A.  The  composition  of  epi¬ 
sodic  memory.  (ONR  Contract  No.  N00014-76-C— 0270)  Evanston, 
Illinois:  Northwestern  University,  May  1977.  (NTIS  No.  AD  A040696) 

Tables 


Table  1 

Stimulus  Presentation 


DOC  -  \ 

«n«  - 

WIN  -  i 

pec  -  4 
Hew  -  i 

WIN  -  3 

nc  -  i 

DOC  -  * 

new  -  i 
no*  -  i 


Prvt*  »hovn  to  S« 


Correct  fetponee 


Table  2 

Mean  Percent  Correct 
Reliabilities  for  List  2 


.32  .0*  .*2  .01  .0?  .*3  -.04  .13  .2> 

.31  .4«  .46  .JJ  .44  .30  .38  .41 

.62  .2#  .00  .33  .31  ,7T  .68 

.33  .  34  .TO  .41  .49  .TO 

.14  .IT  .11  .46  .33 

.33  .24  .31  .41 


2*  .03  .24 


45 


PROCEEDINGS  OF  THK  24TH  ANNUAL  MEETING  OF  THF.  HUMAN  FACTORS  SOCIETY 
LOS  ANGFI.FS,  CA.  13-17  OCTOBER  1980 


ITEM  RECOGNITION  AS  A  PERFORMANCE  EVALUATION  TEST  FOR  ENVIRONMENTAL  RESEARCH 

Robert  C.  Carter,  Robert  S.  Kennedy,  Alvah  C.  Bittner,  Jr.,  and  Michele  Krause 
Naval  Biodynamics  Laboratory,  New  Orleans,  LA  70189 

ABSTRACT 


Item  Recognition  (Sternberg,  1966)  is  a  task  which  reflects  the  operation  of  human  memory.  This 
task  was  considered  as  a  candidate  for  use  in  a  battery  of  Performance  Fvaluation  Tests  for  Environ¬ 
mental  Research  (PETER) .  Environmental  research  involves  comparison  of  performances  in  a  baseline 
environment  and  in  a  novel  environment.  It  is  desirable  that  scores  be  stable  at  different  occasions 
in  the  baseline  environment,  so  that  changes  due  to  the  novel  environment  will  be  clear  if  they  occur 
It  was  found  that  item  recognition  results  were  similar  to  those  obtained  by  other  investigations, 
although  the  traditional  item  recognition  score  (slope)  was  unreliable  across  repeated  measurements. 
The  response  time  (RT)  was  stable  for  each  of  the  four  memory  set  sizes  (1,  2,  3  6  9  items),  from  the 


standpoint  of  reliability,  after  the  fourth  session. 

INTRODUCTION 

Sternberg's  (1966,  1975)  item  recognition 
task  has  recently  been  suggested  for  use  as  a  per¬ 
formance  evaluation  test  (Rose,  1974).  If  a  test 
is  to  be  used  for  environmental  research,  it  must 
be  administered  repeatedly,  usually  to  the  same 
subjects,  in  a  baseline  condition  and  in  the  novel 
environment.  It  would  be  desirable  for  a  test  to 
provide  unchanging  scores  in  the  baseline  condition 
because  any  change  associated  with  repeated  measure¬ 
ment  would  be  confounded  with  changes  of  perfor¬ 
mance  due  to  the  environment.  Therefore,  experi¬ 
ments  are  being  conducted  to  determine  whether 
tasks  yield  stable  scores  which  qualify  them  for 
use  as  Performance  Evaluation  Tests  for  Environ¬ 
mental  Research  (Kennedy,  Bittner  &  Harbeson, 

1979).  Jones  (1980)  suggests  that  stability  is 
indicated  when:  (1)  mean  performance  reaches 
nearly  constant  slope  over  time,  (2)  between 
subject  variances  are  homogeneous  over  time,  and 
(3)  relative  performance  standings  of  the  subjects, 
reflected  in  cross-session  reliabilities,  are 
.onstant  over  time.  The  latter  two  of  these 
stability  criteria,  it  is  noteworthy,  are  suffi¬ 
cient  requirements  for  simple  repeated  measures 
analysis  of  variance  (Winer,  1971). 

METHOD 

Subjects  were  21  Navy  enlisted  males  meeting 
qualifications  described  by  Thomas,  Majewski, 

Ewing  and  Gilbert  (1978).  Testing  was  conducted 
once  each  day  beginning  on  a  Monday  and  continuing 
for  fifteen  consecutive  weekdays.  The  test  sessions 
lasted  about  15  minutes  per  subject  per  day. 

Subjects  in  this  item  recognition  task  were 
presented  with  a  series  of  one  to  four  digits 
called  the  positive  set  which  were  presented  for  1 
sec.  per  item.  All  other  digits  constituted  the 
negative  set.  A  probe  digit  followed  presentation 
of  the  positive  set  by  2  sec.  The  subject  was  to 
select  one  of  two  responses  depending  upon  whether 
the  probe  was  from  the  positive  or  negative  set. 

The  duration  from  onset  of  the  probe  to  the  response 
was  recorded  as  the  response  time  (RT) .  Each 
session  Included  ten  trials  for  each  positive  set 
size  of  1,  2,  3  or  4  unique  digits.  Half  of  these 
trials  Included  probes  from  the  positive  set,  and 
half  were  from  the  negative  set.  Within  these 


restrictions  the  digits  of  the  positive  set  and 
the  probe  digits  were  chosen  at  random,  and  were 
different  on  each  day,  but  were  the  same  for  all 
subjects  on  any  particular  day.  Daily  means  and 
standard  deviations,  and  lnterday  correlation 
(reliability)  matrices  (all  calculated  across 
subjects)  were  developed  for  each  of  the  follow¬ 
ing  scores:  Mean  RT’s  for  positive  set  sizes  1, 

2,  3,  and  4;  slope  of  mean  RT  versus  set  size; 
Intercept  of  mean  RT  versus  set  size;  and  percent 
error.  The  slope  and  intercept  scores  for  each 
subject  on  each  day  were  computed  by  least 
squares  regression.  There  was  a  regression 
equation  for  each  subject  which  expressed  the  40 
RTs  for  that  subject  on  that  day  as  a  linear 
function  of  positive  set  size.  Slopes  and 
intercepts  from  these  equations  represented 
individual  differences,  the  reliabilities  of 
which  were  shown  in  intertrial  correlation 
matrices.  Aggregate  performance  of  all  subjects 
on  each  day  was  summarized  by  averaging  the 
subjects'  slope  or  intercept  scores. 

Slope  and  intercept  scores  were  calculated 
based  on  Sternberg's  (1966)  finding  that  RT 
increased  linearly  with  positive  set  size.  This 
finding  has  since  been  confirmed  many  times 
(Sternberg,  1975).  The  slope  may  be  interpreted 
as  the  rate  of  search  through  short-term  memory 
and  the  intercept  is  interpreted  as  time  required 
for  stimulus  processing  and  response  formulation 
(cf.  Sternberg,  1966,  1975).  These  scores  have 
been  found  to  reflect  differences  among  indivi¬ 
duals'  information  processing  capabilities 
associated  with  age  (Anders,  Fozard,  &  Lillyquist, 
1972)  and  with  aphasia  (Swinney  &  Taylor,  1971). 

RESULTS  AND  DISCUSSION 

The  present  experiment  differs  from 
Sternberg's  (1966)  in  that  he  reports  results 
for  "practiced"  subjects  while  we  show  how  the 
results  are  affected  by  the  degree  of  experience. 
Our  intercept  score  (450  msec)  did  not  change 
appreciably  during  the  experiment  (F(14,280)  » 
1.53,  £,>.1)  and  is  comparable  to  that  reported 
by  Sternberg  (397.2  m3ec) .  However,  our  slope 
scores  (Figure  1)  decreased  with  practice 
(F(14,280)  *  5.32,  £<f.005).  This  is  a  common 
finding  (Kristofferson,  1972;  Ross,  1970;  and 
Simpson,  1972).  Figure  1  indicates  that  the 


slopes  do  not  change  very  much  after  the  third  day 
of  testing.  Our  average  slope  score  on  the  third 
-ay  (41.2  msec/item)  is  very  similar  to  the  average 
slope  obtained  by  Sternberg  with  practiced  subjects 
(37.9  msec/itan)  .  Our  results  contrast  with 
Sternberg's  in  that  our  subjects*  error  rate  was  1 
much  greater  than  his  (6X  versus  1.3X);  the  error 
rate  did  not  change  with  practice  (F(14,280)  «  .8; 
£>.3). 

Our  main  Interest  was  to  evaluate  the  use  of 
the  slope  and  Intercept  scores  as  measures  of 
individual  differences.  Sternberg  (1969)  reported 
individual  differences  of  slopes,  which  he  conjec¬ 
tured  to  be  related  to  different  strategies  of 
memory  scanning.  Ue  too  obtained  significant 
individual  differences  of  slopes  (F(20,280)  -  2.S7, 
£<".005)  and  intercepts  (F(20,280)  ■  14.25, 
£<.005).  The  cross-session  reliabilities  of 
these  slope  and  intercept  scores  indicate  the 
degree  to  which  the  scores  represent  enduring 
abilities.  Figure  2  illustrates  selected  cross- 
session  reliabilities  of  the  slope  scores.  This 
figure  shows  the  extent  to  which  subjects'  scores 
tended  to  remain  in  the  same  relationship  to  each 
other  from  day-to-day.  The  complete  set  of  cross¬ 
session  reliabilities  for  slopes  are  shown  in 
Table  1.  The  reliabilities  are  uniformly  low,  and 
if  they  do  stabilize,  it  is  at  a  uselessly  low 
level.  Similar  results  were  obtained  for  the 
intercept  scores.  The  poor  reliabilities  cast 
doubt  upon  the  potential  of  these  scores  for 
measurement  of  individual  differences  and  they 
would  make  the  test  relatively  insensitive  to 
environmental  effects. 

In  contrast,  the  reliabilities  of  the  RTs 
from  which  the  slopes  are  calculated  are  relatively 
high,  being  generally  greater  than  r  «  .70. 

Figure  3  shows  cross-session  reliabilities  of  RT 
for  positive  set  size  4  (RT4)  .  (Similar  results 
were  obtained  for  other  positive  set  sizes). 

These  reliabilities  stabilize  after  Day  3  and  are 
substantial  enough  to  differentiate  individuals  (r 
-  .80).  The  complete  set  of  cross-session  reliabi¬ 
lities  for  the  4-item  RTs  are  shown  in  Table  2. 
Unfortunately,  the  RTs  are  not  a9  meaningful  as 
the  slopes  and  intercepts.  For  instance,  the 
slope  is  supposed  to  represent  the  rate  of  memory 
scanning.  But  does  it?  Figure  4  shows  the  mean 
reaction  time  to  positive  set  sizes  1  through  4  on 
each  day  of  the  experiment.  If  the  rate  model 
were  appropriate,  then  RT2-RT1  -  RT3-RT2  -  RT4-RT3. 
Clearly  this  is  not  the  case.  The  interval  between 
RT1  and  RT2  is  usually  greater  than  any  of  the 
others.  Perhaps  the  slope  is  unreliable  because 
the  rate  it  is  supposed  to  represent  is  a  fiction. 
Numerous  authors  have  found,  as  we  did,  that  the 
RT  versus  positive  set  size  curve  is  nonlinear 
(Simpson,  1972;  Kristof ferson ,  1972;  Swanson, 

1974;  Juola  4  Atkinson,  1971;  Ross,  1970).  In  our 
case,  the  nonlinearity  cannot  be  explained  as  due 
to  a  time-error  tradeoff  because  error  rate  was 
Independent  of  positive  set  size  (F(3,60)  »  .16, 

£  >.5).  Fitting  a  line  to  such  data  adds  a  bias 
(Draper  &  Smith,  1966)  component  to  the  error  of 
the  fit.  Reliability  is  the  ratio  of  the  true 


variance  to  the  sum  of  error  plu6  true  variance. 
Inflation  of  the  error  by  the  bias  would  cause 
the  reliability  ratio  to  collapse,  as  it  did  for 
the  slope  scores  in  this  experiment . 

Even  though  the  increase  of  RT  with  memory 
load  is  not  linear,  it  is  still  meaningful  to 
think  of  the  increment  of  RT  when  positive  set 
size  is  increased.  For  Instance,  if  RT4-RT1  were 
different  from  one  person  to  another  or  if  it 
were  altered  by  a  change  in  the  environment,  then 
we  could  infer  a  difference  in  the  amount  of  time 
required  to  mentally  compare  the  second,  third 
and  fourth  members  of  the  positive  set  with  the 
probe.  The  estimate  of  RT4-RT1  could  be  improved 
by  accounting  for  the  covariance  of  RT4  and  RT1 
(Cronbach  6  Furby,  1965).  This  refined  estimate 
of  the  time  required  for  mental  scanning  could 
come  from  an  Analysis  of  Covariance  of  RT4,  with 
RT1  as  the  covariate. 


Sternberg's  (1966)  item  recognition  task  has 
been  scrutinized  as  a  candidate  performance 
evaluation  test  for  environmental  research. 
Sternberg  and  others  (cf.,  Sternberg,  1975)  have 
interpreted  the  slope  of  RT  versus  positive  set 
size  to  reflect  the  rate  of  memory  scanning 
during  recognition.  Our  results  are  similar  to 
those  of  others  who  have  studied  this  memory 
scanning  Blope,  except  that  we  have  calculated 
cross-session  reliabilities  for  repeated  measure¬ 
ments  of  subjects'  memory  scanning  speed.  The 
reliabilities  are  vanishingly  small,  indicating 
either  that  a  person's  memory  scanning  rate  is 
changeable  (and  hence,  of  little  use  as  an  indivi¬ 
dual  difference  parameter),  or  that  the  slope 
score  is  a  poor  way  to  represent  memory  scanning 
rate.  The  later  interpretation  is  supported  by 
the  finding  that  RT  (especially  for  large  positive 
set  size)  is  an  extremely  stable  score  which  also 
reflects  memory  scanning  rate.  RT  for  a  large 
positive  set  size,  with  RT1  as  a  covariate,  is 
recommended  for  further  consideration  as  a  perfor¬ 
mance  evaluation  test  which  represents  memory 
scanning  speed  during  environmental  research. 


REFERENCES 

Anders,  T.  R.,  Fozard,  J.  L.,  and  Lillvquist,  T. 
D.  Effects  of  age  upon  retrieved  from 
short-term  memory.  Developmental  Psychology 
1972,  214-217. 

Cronbach,  L.  J.,  and  Furby,  L.  How  we  should 

measure  change-or  should  we?  Psychological 
Bulletin,  1970,  74,  6B-R0. 


Draper,  N.  R. ,  and  Smith, 
analysis.  New  York: 


H.  Applied  regression 
John  Wiley,  1966. 


Jones,  B.  Stabilization  and  task  definition  in 
a  performance  test  battery.  (NBDL  Monograph 
No.  M-0001)  New  Orleans,  LA’  Naval 
Biodynamics  Laboratory,  1980. 


Juola,  J.  F.,  and  Atkinson,  R.  C.  Memory  scanning 
for  words  versus  categories.  Journal  of 
Verbal  Learning  and  Verbal  Behavior,  1971, 

10,  522-527, 


Kennedy,  R.  S.,  Bittner,  Jr.,  A.  C.,  and  Harbeson, 

M.  M.  An  Engineering  approach  to  the  standard¬ 
ization  of  Performance  Evaluation  Tests  for 
Environmental  Research  (PETER).  Proceedings 
of  the  11th  Annual  Conference  of  the  Environ¬ 
mental  Design  and  Research  Association  (EDRA)  ■ 
Charleston,  SC,  March,  1980. 


_ -T 


Figure  1.  Item  Recognition  Slope  Means  (X)  and 
Standard  Deviations  (S.D.)  Over  15  Days  (N«21). 


Kristof ferson,  M.  W.  When  item  recognition  and 
visual  search  functions  are  similar. 
Perception  and  Psychophysics,  1972,  12, 
379-389. 


Rose,  A.  M.  Human  information  processing:  An 

assessment  and  research  battery.  Ann  Arbor : 
The  University  of  Michigan,  1974. 

(Technical  Report  No.  46) 

Ross,  J.  Extended  practice  with  a  single  character 
classification  task.  Perception  and 
Psychophys lcs ,  1970,  J3,  276-278. 

Sternberg,  S.  High  speed  scanning  in  human  memory. 
Science,  1966,  153,  652-654. 

Sternberg,  S.  Memory  scanning:  Mental  processes 
revealed  by  reaction-time  experiments. 

American  Scientist,  1969,  57 ,  421-457. 

Sternberg,  S.  Memory  scanning:  New  findings  and 
current  controversies.  Quarterly  Journal  of 
Experimental  Psychology,  1975,  27,  1-32. 

Swanson,  J.  M.  The  neglected  negative  set. 

Journal  of  Experimental  Psychology,  1974, 

103,  1019-1026. 

Swinney,  D.  A.,  and  Taylor,  0.  L.  Short-term 
memory  recognition  search  in  aphasics. 

Journal  of  Speech  and  Hearing  Research,  19',  1 , 
14,  578-588. 

Thomas,  D.  J.,  Majewski,  P.  1.,  Ewing,  C.  L.,  and 
Gilbert,  N.  S.  Medical  qualification  proce¬ 
dures  for  hazardous-duty  aeromedical  research. 
London:  AGARD,  1977  (Conference  Proceedings 
No.  231  A3  P.  1-13,  1978). 

Winer,  B.  J.  Statistical  principles  in  experi¬ 
mental  design  (2nd  ed.).  New  York:  McGraw- 
Hill,  1971. 


Figure  2.  Item  Recognition  Slope  Score:  Inter¬ 
trial  Correlations  Between  Selected  Days  (1,  2, 
4,  6,  8,  10,  6  12)  and  Following  Days  (N-21). 


i 


Figure  3.  Item  Recognition  Time  for  Positive  Set 
Size  Four:  Intertrial  Correlations  Between 
Selected  Days  (1,  2,  4,  6,  8,  10,  &  12)  and 
Following  Days  (N-2I). 


49 


TABLE  1 

It«a  Recognition:  Slop*  R*ll*blllti«*  over  15  Day*  (o-21) 


DAYS  2  3 

4 

5 

6  7 

8 

9 

10 

11  12 

13 

14 

15 

1  -13*  03 

-09 

16 

21  -21 

28 

92 

21 

36  -22 

21 

15 

24 

2  02 

-10 

-03 

44  01 

41 

-38 

-07 

19  61 

31 

05 

21 

3 

-06 

31 

19  -17 

-07 

26 

02 

45  03 

11 

-43 

11 

4 

39 

23  54 

01 

12 

-07 

28  -20 

05 

-02 

-02 

5 

31  02 

-11 

47 

37 

65  -32 

37 

04 

-08 

6 

-37 

34 

-02 

03 

43  36 

53 

25 

12 

7 

-07 

-14 

-05 

04  -09 

-24 

-04 

-07 

8 

21 

-19 

03  40 

58 

21 

28 

9 

40 

37  -56 

25 

09 

03 

10 

30  -57 

-02 

19 

01 

11 

-19 

33 

-03 

28 

12 

42 

-05 

01 

n 

14 

17 

14 

-09 

TABLE  Z 

It**  Recognition:  R*llabllltl*a  for  tT  to  poaltlv*  **t  ale* 
4  over  15  Daya  (n-21) 


DAYS  2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1  48* 

56 

60 

57 

57 

59 

67 

44 

50 

57 

54 

60 

59 

56 

2 

61 

53 

41 

45 

39 

39 

25 

45 

48 

40 

26 

46 

20 

3 

78 

82 

73 

73 

72 

64 

76 

84 

82 

63 

66 

65 

4 

85 

78 

77 

75 

69 

80 

91 

81 

71 

66 

76 

5 

84 

84 

87 

87 

91 

85 

92 

86 

87 

88 

6 

83 

80 

87 

80 

71 

83 

81 

78 

68 

7 

79 

82 

7* 

75 

77 

80 

77 

81 

8 

85 

90 

76 

86 

86 

86 

80 

9 

88 

71 

86 

91 

as 

81 

10 

83 

88 

88 

91 

81 

11 

90 

74 

69 

86 

12 

83 

81 

86 

*  Decimal  Point*  Omlttad 


*  Decimal  Point  Qfelttad 


